Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

path: add path.glob #47490

Closed
wants to merge 3 commits into from
Closed

path: add path.glob #47490

wants to merge 3 commits into from

Conversation

MoLow
Copy link
Member

@MoLow MoLow commented Apr 9, 2023

an alternative for #47486 (suggested at #40731 (comment))

as my experience in cpp is much more limited than in js - this code probably has many issues, so feel free to comment on anything.

the glob implementation is used from the linux kernel, it complies with https://man7.org/linux/man-pages/man7/glob.7.html, and is licenced under MIT

this implementation lacks some features that minimatch has such as brace expansion, but according to my tests Python comes with a similar implementation: https://docs.python.org/3/library/glob.html

@nodejs-github-bot
Copy link
Collaborator

Review requested:

  • @nodejs/gyp

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run. labels Apr 9, 2023
@MoLow
Copy link
Member Author

MoLow commented Apr 9, 2023

CC @anonrig WDYT?

@targos
Copy link
Member

targos commented Apr 9, 2023

You have to be careful as both bash and Linux are under the GPL license

@bnoordhuis
Copy link
Member

This can't be merged as-is for the reason @targos mentions. I suggest closing this and starting over from a clean slate.

@anonrig
Copy link
Member

anonrig commented Apr 9, 2023

@MoLow I think this is a better approach than the previous one. Thank you for valuing my opinion. Happy help on the C++ side.

@MoLow
Copy link
Member Author

MoLow commented Apr 9, 2023

@targos @bnoordhuis according to the source code this specific module is licenced under MIT as well:
https://github.com/torvalds/linux/blob/cdc9718d5e590d6905361800b938b93f2b66818e/lib/glob.c#L10

@MoLow MoLow force-pushed the add-path-glob branch 2 times, most recently from bdbd172 to 9ae0a18 Compare April 9, 2023 13:53
@MoLow MoLow force-pushed the add-path-glob branch 3 times, most recently from 207625e to f21664c Compare April 9, 2023 17:02
src/node_path.cc Show resolved Hide resolved
src/node_path.cc Show resolved Hide resolved
* it against the remaining unmatched tail of str. Return false
* on mismatch, or true after matching the trailing nul bytes.
*/
for (;;) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: There is no need to iterate each character. input.find_first_of("?*[") until the input is finished might produce better results if performance is the priority. If not, we can keep it as it is, and later improve it. Your call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand these comments - is there really motivation in diverging from the implementation lifted from glob rather than benefit from its updates?

If we have improvement suggestions wouldn't it be better to upstream them to linux glob?

(genuinely asking, not sure it's. just not really intuitive)

unsigned char d = *pat++;

switch (d) {
case '?': /* Wildcard: anything but nul */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This switch case can be removed and be simplified for better performance.

src/node_path.h Outdated Show resolved Hide resolved
Copy link
Member

@anonrig anonrig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update documentation too?

@MoLow MoLow added semver-minor PRs that contain new features and should be released in the next minor version. notable-change PRs with changes that should be highlighted in changelogs. labels Apr 9, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Apr 9, 2023

The notable-change PRs with changes that should be highlighted in changelogs. label has been added by @MoLow.

Please suggest a text for the release notes if you'd like to include a more detailed summary, then proceed to update the PR description with the text or a link to the notable change suggested text comment.

@MoLow MoLow added the commit-queue-squash Add this label to instruct the Commit Queue to squash all the PR commits into the first one. label Apr 9, 2023
Copy link
Contributor

@cjihrig cjihrig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Left some comments. There are also some things about the licensing I'm unsure about.

doc/api/path.md Outdated Show resolved Hide resolved
@@ -52,6 +52,7 @@
V(mksnapshot) \
V(options) \
V(os) \
V(path) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a strong opinion, but maybe we should name the binding and file glob instead of path?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjihrig I'm planning on moving a couple of functions to C++ (actually done the same exact change), so this change is beneficial for me too.

test/parallel/test-path-glob.mjs Outdated Show resolved Hide resolved
test/parallel/test-path-glob.mjs Outdated Show resolved Hide resolved
Comment on lines +2170 to +2173
Usage-Guide:
To use the MIT License put the following SPDX tag/value pair into a
comment according to the placement guidelines in the licensing rules
documentation:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we following this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are instructions for maintainers of the Linux kernel:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/LICENSES/preferred/MIT?h=v6.3-rc6
I am not really sure how we should embed this since it only really appears here
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/lib/glob.c?h=v6.3-rc6#n10

see also the comment on top of #define MODULE_LICENSE:

/*
 * The following license idents are currently accepted as indicating free
 * software modules
 *
 *	"GPL"				[GNU Public License v2]
 *	"GPL v2"			[GNU Public License v2]
 *	"GPL and additional rights"	[GNU Public License v2 rights and more]
 *	"Dual BSD/GPL"			[GNU Public License v2
 *					 or BSD license choice]
 *	"Dual MIT/GPL"			[GNU Public License v2
 *					 or MIT license choice]
 *	"Dual MPL/GPL"			[GNU Public License v2
 *					 or Mozilla license choice]
 *
 * The following other idents are available
 *
 *	"Proprietary"			[Non free products]
 *
 * Both "GPL v2" and "GPL" (the latter also in dual licensed strings) are
 * merely stating that the module is licensed under the GPL v2, but are not
 * telling whether "GPL v2 only" or "GPL v2 or later". The reason why there
 * are two variants is a historic and failed attempt to convey more
 * information in the MODULE_LICENSE string. For module loading the
 * "only/or later" distinction is completely irrelevant and does neither
 * replace the proper license identifiers in the corresponding source file
 * nor amends them in any way. The sole purpose is to make the
 * 'Proprietary' flagging work and to refuse to bind symbols which are
 * exported with EXPORT_SYMBOL_GPL when a non free module is loaded.
 *
 * In the same way "BSD" is not a clear license information. It merely
 * states, that the module is licensed under one of the compatible BSD
 * license variants. The detailed and correct license information is again
 * to be found in the corresponding source files.
 *
 * There are dual licensed components, but when running with Linux it is the
 * GPL that is relevant so this is a non issue. Similarly LGPL linked with GPL
 * is a GPL combined work.
 *
 * This exists for several reasons
 * 1.	So modinfo can show license info for users wanting to vet their setup
 *	is free
 * 2.	So the community can ignore bug reports including proprietary modules
 * 3.	So vendors can do likewise based on their own policies
 */

Copy link
Member Author

@MoLow MoLow Apr 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did what I best understood from what I found in the Linux kernel and in other license examples so if anyone can help out with what really should be used as the license - I'd appreciate

LICENSE Show resolved Hide resolved
Copy link
Member

@anonrig anonrig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the current implementation. Some typos that @cjihrig needs to be addressed, but other than that I think this is really good work. Thanks @MoLow

@MoLow
Copy link
Member Author

MoLow commented Apr 10, 2023

I have played around with this a bit, and the globbing algorithm is much more limited than I have thought, I am looking into other alternatives

@benjamingr
Copy link
Member

@Trott @joesepi (pinging because you're our OpenJS CPC delegates), can either of you help with getting in touch with legal from the Linux Foundation to make sure the license was understood correctly and we are in compliance?

I remember that once upon a time in the Node.js foundation days that's something that wasn't too hard to do and we'd all like to avoid potential issues with this.

Copy link
Member

@benjamingr benjamingr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly personally prefer a plain JS implementation if the performance delta isn't big because it's more obviously secure and we haven't seen a benchmark to indicate this is actually faster at all. I'm also ignorant about what use cases linux glob was optimized for.

That said - this looks correct and I'm fine with it landing with green CI and we can always switch the implementation later.

@Trott
Copy link
Member

Trott commented Apr 10, 2023

@Trott @joesepi (pinging because you're our OpenJS CPC delegates), can either of you help with getting in touch with legal from the Linux Foundation to make sure the license was understood correctly and we are in compliance?

Yes, I'll try to loop in the appropriate Foundation folks.

@Trott
Copy link
Member

Trott commented Apr 10, 2023

@Trott @joesepi (pinging because you're our OpenJS CPC delegates), can either of you help with getting in touch with legal from the Linux Foundation to make sure the license was understood correctly and we are in compliance?

@rginn

@tniessen
Copy link
Member

this implementation lacks some features that minimatch has such as brace expansion

Given that the original feature request (#40731) as well as the previous PR (#47486) specifically targeted minimatch/glob, could you be more specific as to the differences, especially with respect to different operating systems? This part of @isaacs' #40731 (comment) sounds like a proper implementation would be rather complicated:

If you only need a subset of the bash globbing behavior, and especially if you don't need hot path performance or windows support, then a couple hundred line js module can definitely suffice. But I would not recommend pulling such a glob subset into core, unless the node team is interested in a never ending stream of bug reports for every edge case where it'll diverge from shell behavior. Having an incorrect or incomplete glob module in core is arguably worse than not having one at all.

@isaacs
Copy link
Contributor

isaacs commented Apr 10, 2023

Do not assume that any particular glob implementation in C will be faster. The pattern expansion implementation found in bash and zsh is likely to have serious performance issues in node. Shell expansion is optimized for intuitive shell UX and correct behavior in low-memory systems, and will likely be a footgun if anyone puts it to serious use in a node program at runtime. Node programs need to go fast, and they're memory hogs, it's the exact opposite of what Bash's glob is designed for.

I have benchmarked node-glob, fast-glob, and globby extensively, and the source of latency is almost entirely the overhead of accessing the filesystem, with GC cleanup a distant second. The bash implementation (at least last I checked) is an absolute glutton with syscalls, caching almost nothing, walking directories multiple times, etc. And that's fine, because that's what it's for. It's built to be intuitive, consistent, and efficient with memory usage, not to be fast. There's even a note in man bash warning that it's slow if you do anything too complicated with it. (I commented out the bash tests from my glob benchmark, because they're too slow to be relevant. Like 100 to 10,000 times slower in many fairly straightforward cases. I only have them there to verify correctness of the returned results.)

Just to caveat this, of course, I haven't closely investigated bash's glob it in a few years, since a few years ago when bash 5.0 came out, but I doubt that much has changed, since the intended use case is still the same, and most potential improvements to performance would involve breaking changes in behavior. The original implementation of node-glob was a binding to Guido van Rossum's libglob in C, and I rewrote it in JS largely to make it go faster, as well as to add globstar, extglob, and brace expansion.

Also, as it seems @MoLow is finding, what we colloquially think of as "glob" is actually a few different things: posix regular expression classes, globstar, extended glob patterns, brace expansion, variable string expansion, and then file path portion expansion. If you pull in just that last part, but don't support any of the rest, it's going to be very underwhelming.

As I said previously:

I would not recommend pulling such a glob subset into core, unless the node team is interested in a never ending stream of bug reports for every edge case where it'll diverge from shell behavior. Having an incorrect or incomplete glob module in core is arguably worse than not having one at all.

@MoLow
Copy link
Member Author

MoLow commented Apr 10, 2023

@isaacs Thank you very much for your detailed comment!

Also, as it seems @MoLow is finding, what we colloquially think of as "glob" is actually a few different things: POSIX regular expression classes, globstar, extended glob patterns, brace expansion, variable string expansion, and then file path portion expansion. If you pull in just that last part, but don't support any of the rest, it's going to be very underwhelming.

yes, I tried adding more tests from the minimatch repo and learned that this PR includes a very very small part of what I really needed and expected from glob

As I said previously:

I would not recommend pulling such a glob subset into core, unless the node team is interested in a never-ending stream of bug reports for every edge case where it'll diverge from shell behavior. Having an incorrect or incomplete glob module in core is arguably worse than not having one at all.

the reason why we want glob in the core is for features in core such as node --test <pattern> or node --watch <pattern>
this leaves us with the option of adding minimatch or glob as a dependency and not exposing it at all (partof what was done at #47486).

@isaacs
Copy link
Contributor

isaacs commented Apr 10, 2023

If you really want to have globbing built into core, the reasonable thing to do is to pull in either fast-glob or node-glob. They're both "big" complex modules that do a lot of stuff and have a bunch of optimizations and moving parts, around 5-10 kLoC each, but they're extremely well tested, complete, and performant. The choice depends ultimately on what kind of features/behavior/performance are desired. Writing a brand new clean-room implementation is an option, but probably the costliest option possible, and given that fast-glob and node-glob evolved to be so similar to one another just by following the path of optimization, it's reasonable to assume that you'll end up with a very similar implementation anyway. Skip the effort, get the good result.

Also: expanding globs to a set of filesystem entries should really live in fs, not path, since it's inherently a filesystem walk. But it would make sense for minimatch (or whatever you land on for doing that part of the job) live in path, since that's just a string processing operation with no filesystem involvement.

Along the way, you'll need a comprehensive file-system walking implementation. (Bash's glob does not have one. It caches nothing and just keeps recursing until done. It's the slowest possible way to tackle this problem, but also the most straightforward and memory efficient, and the approach node-glob used to use.) Fast-glob uses @nodelib/fs.walk for this, node-glob uses path-scurry. May as well expose that as an fs.walk() method, as that's another common thing people need, and there's perhaps room for some lowlevel optimizations that would make node-glob and fast-glob even faster if they could take advantage of it.

@isaacs
Copy link
Contributor

isaacs commented Apr 10, 2023

this leaves us with the option of adding minimatch or glob as a dependency and not exposing it at all (partof what was done at #47486).

Why not pull it in as a dep and also expose it? (Not challenging, genuinely curious.)

const { minimatch } = require('node:internal/lib/minimatch.js')
path.match = (pattern, pathLike) => minimatch.match(pattern, pathLike)

const { glob, globSync, globIterate, globIterateSync } = require('node:internal/lib/glob.js')
fs.glob = (pattern, options, cb) => {
  glob(pattern, options).then(res => cb(null, res), cb)
}
fs.globSync = globSync
fs.promises.glob = glob
// ...

Like, ok, it's a lot of code, but you'll end up writing more or less the same code anyway to implement it (or wish you had, if you ship bash's globs lol)


edit: Oh, I see, something to do with not using Primordials. I'm not sure what those are 😅

@MoLow
Copy link
Member Author

MoLow commented Apr 10, 2023

Why not pull it in as a dep and also expose it? (Not challenging, genuinely curious.)

there were concerns raised about that #47486 (comment)
just a few issues with exposing it is it won't use built-in node errors or primordials,
changing the API won't necessarily respect node semver policy, etc etc.

@MoLow
Copy link
Member Author

MoLow commented Apr 10, 2023

Oh, I see, something to do with not using Primordials. I'm not sure what those are 😅

@isaacs primordials are basically a frozen reference to js built-ins to avoid them being tamperd
https://github.com/nodejs/node/blob/c6dabe30832e40dab7fa35fd2ccd0135c74ca091/doc/contributing/primordials.md

@isaacs
Copy link
Contributor

isaacs commented Apr 10, 2023

yes, I tried adding more tests from the minimatch repo and learned that this PR includes a very very small part of what I really needed and expected from glob

Also, take a look at the benchmark scripts in https://github.com/isaacs/node-glob. I don't think a node core implementation needs to necessarily be faster than fast-glob or node-glob, but it should be within that range to not be a hazard if it's exposed at all. node:test's usage is likely to be small enough to never be a huge issue, but I would bet anything that the second it lands, people are going to be asking for it to be exposed for their programs as well.

@isaacs
Copy link
Contributor

isaacs commented Apr 10, 2023

Oh, I see, something to do with not using Primordials. I'm not sure what those are 😅

@isaacs primordials are basically a frozen reference to js built-ins to avoid them being tamperd https://github.com/nodejs/node/blob/c6dabe30832e40dab7fa35fd2ccd0135c74ca091/doc/contributing/primordials.md

Ah, nice. Yeah, porting a module as involved as glob (+path-scurry, +minimatch, etc.) would be quite a bit of work, and likely make it painful to pull future patches.

Maybe it'd be possible to "mimic" some of that hardening in the library itself (or in a standalone userland module), and then just swap out import { ... } from './primordials.js' with the equivalent import from node:lib/internal/per_context/primordials.js? If it doesn't impact performance, I'd be open to it.

@MoLow MoLow closed this Apr 10, 2023
@MoLow MoLow deleted the add-path-glob branch April 10, 2023 22:29
@MoLow
Copy link
Member Author

MoLow commented Apr 10, 2023

Along the way, you'll need a comprehensive file-system walking implementation. (Bash's glob does not have one. It caches nothing and just keeps recursing until done. It's the slowest possible way to tackle this problem, but also the most straightforward and memory efficient, and the approach node-glob used to use.) Fast-glob uses @nodelib/fs.walk for this, node-glob uses path-scurry. May as well expose that as an fs.walk() method, as that's another common thing people need, and there's perhaps room for some lowlevel optimizations that would make node-glob and fast-glob even faster if they could take advantage of it.

@isaacs If I understand correctly there is an effort to do something like this: #41439

@isaacs
Copy link
Contributor

isaacs commented Apr 10, 2023

Oh, neat.

That really needs a streaming API (ideally not a Stream per se, but a low level "get the next thing" kinda like the opendir/scandir stuff) as well as a way to filter whether a directory gets descended into or not, or it won't be of much use for a glob walker. Eg: https://github.com/isaacs/node-glob/blob/main/src/walker.ts#L238-L309

MoLow added a commit to MoLow/node that referenced this pull request Apr 21, 2023
PR-URL: nodejs#47499
Refs: nodejs#47490
Refs: nodejs#47486
Reviewed-By: Marco Ippolito <[email protected]>
Reviewed-By: Robert Nagy <[email protected]>
targos pushed a commit that referenced this pull request May 2, 2023
PR-URL: #47499
Refs: #47490
Refs: #47486
Reviewed-By: Marco Ippolito <[email protected]>
Reviewed-By: Robert Nagy <[email protected]>
danielleadams pushed a commit that referenced this pull request Jul 6, 2023
PR-URL: #47499
Refs: #47490
Refs: #47486
Reviewed-By: Marco Ippolito <[email protected]>
Reviewed-By: Robert Nagy <[email protected]>
MoLow added a commit to MoLow/node that referenced this pull request Jul 6, 2023
PR-URL: nodejs#47499
Refs: nodejs#47490
Refs: nodejs#47486
Reviewed-By: Marco Ippolito <[email protected]>
Reviewed-By: Robert Nagy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. commit-queue-squash Add this label to instruct the Commit Queue to squash all the PR commits into the first one. lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run. notable-change PRs with changes that should be highlighted in changelogs. semver-minor PRs that contain new features and should be released in the next minor version.
Projects
None yet
Development

Successfully merging this pull request may close these issues.