Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement bash (with globstar) style globbing #2522

Closed
wants to merge 10 commits into from
Closed

Implement bash (with globstar) style globbing #2522

wants to merge 10 commits into from

Conversation

hdgarrood
Copy link
Contributor

Fixes #784, #2030, resolves #713, subsumes #1343, #1344. Picks up where #1975 left off.

Implement a reduced form of GNU bash style globbing. The supported
features should account for 99% of use cases.

Still to do:

  • minimum cabal-version constraint (etc). This seems to be
    necessary because, for example, this commit changes the meaning of *.js,
    which previously would not have matched jquery.cookie.js, and now does.
  • should we add a limitation to disallow * and ** from appearing as the very last item of a glob pattern, to avoid including unwanted items? The documentation in this PR currently describes this limitation but I haven't implemented it yet.
  • test correct behaviour re cabal version constraints in all four cases
  • ensure that the new globs are actually being used in all of the cabal file fields that we expect them to be, that is, data-files, extra-source-files, extra-doc-files (I did this manually).

@hdgarrood
Copy link
Contributor Author

This is failing due to warnings and -Werror. One sec, I'll fix those.

@@ -163,6 +163,7 @@ library
Distribution.Compat.Exception
Distribution.Compat.ReadP
Distribution.Compiler
Distribution.Glob
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move under Distribution.Utils.Glob.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will do.

@23Skidoo
Copy link
Member

23Skidoo commented Apr 4, 2015

Some documentation and a changelog update for this would be nice.

@23Skidoo
Copy link
Member

23Skidoo commented Apr 4, 2015

Still to do: minimum cabal-version constraint.

Do you plan to implement it?

@hdgarrood
Copy link
Contributor Author

Ok, I'll do some documentation and a changelog update.

I was planning on implementing a cabal-version constraint too, yes. Regarding that, do you think it would be ok to simply refuse to build a project if:

  • it contains at least one 'real' glob, and
  • it does not contain cabal-version: >= 1.23 or higher?

Presumably this could lead to cabal files which worked fine before 1.23, and would have continued to work fine after 1.23, but would cause builds to fail upon upgrading Cabal to 1.23? Can we live with that? It's a bit ugly but I can't imagine any alternative.

Implement a reduced form of GNU bash style globbing. The supported
features should account for 99% of use cases.

Still to do: minimum cabal-version constraint. This seems to be
necessary because, for example, this commit changes the meaning of *.js,
which previously would not have matched jquery.cookie.js, and now does.
* Move Distribution.Glob to Distribution.Utils.Glob
* Add documentation for Distribution.Simple.Utils.matchDirFileGlob
@hdgarrood
Copy link
Contributor Author

Also, a couple of other things I've remembered:

  • In a previous issue, someone suggested having some kind of limitation about */** wildcards, which I forgot about. I can think of two ways of implementing it:
    • */** is not allowed to appear as the entire final directory component of a pattern. So Foo/* is not allowed but test/**/Test* is.
    • */** is not allowed to appear at the very end of the glob. So both of the above examples are disallowed. This would, for example, prevent accidental inclusion of test/TestFoo.hi, which I guess is a good idea.
  • The current implementation of parseDirFileGlob is quite naive, and possibly does more work than it needs to, by immediately calling getDirectoryContentsRecursive from the base directory. Do you think this might be a problem?

@hdgarrood
Copy link
Contributor Author

GitHub isn't displaying my most recent commits, for some reason. There should be 4 showing up now, rather than 2. See master...hdgarrood:bash-globbing instead if necessary.

@23Skidoo
Copy link
Member

23Skidoo commented Apr 5, 2015

Presumably this could lead to cabal files which worked fine before 1.23, and would have continued to work fine after 1.23, but would cause builds to fail upon upgrading Cabal to 1.23? Can we live with that?

Hmm, are you sure there's no way around it? This could be a blocker.

The current implementation of parseDirFileGlob is quite naive

Will look at it in more detail soon.

@hdgarrood
Copy link
Contributor Author

Re version constraints - the way I see it:

(supposing this goes into 1.23, which I am aware is perhaps a little premature ;)

Any package using the new globbing will have to use a minimum Cabal version constraint. If we don't do that, then bad things will happen if A authors a package with Cabal >= 1.23 and B tries to build it with Cabal < 1.23. New syntax would fail (I think), and * wildcards would pick up different groups of files.

Come to think of it, maybe we can get around this - if we're building a package with a Cabal file which does use globs, and which does not have a minimum Cabal version constraint of at least 1.23, we could get around the issue by issuing a notice or warning to the command line saying something like "Glob syntax changed in Cabal 1.23. You are currently using the old syntax. To silence this warning, do {something}. To use the new syntax, add a minimum Cabal version constraint >= 1.23".

@mietek
Copy link
Contributor

mietek commented Apr 5, 2015

With this change, do data-files, extra-source-files, and extra-doc-files accept the same globbing syntax?

Does this change fix #2030?

@hdgarrood
Copy link
Contributor Author

I'm pretty sure all of those fields should now accept the same syntax, yes (though I haven't been able to work out how to test that...)

I believe this fixes #2030 too.

@23Skidoo
Copy link
Member

23Skidoo commented Apr 5, 2015

Come to think of it, maybe we can get around this - if we're building a package with a Cabal file which does use globs, and which does not have a minimum Cabal version constraint of at least 1.23, we could get around the issue by issuing a notice or warning to the command line saying something like "Glob syntax changed in Cabal 1.23. You are currently using the old syntax. To silence this warning, do {something}. To use the new syntax, add a minimum Cabal version constraint >= 1.23".

What I think would be ideal:

  • If the package uses only pre-1.23 glob features and the semantics is the same as in <1.23 - business as usual.
  • If the package uses only pre-1.23 glob features, but there are files that wouldn't be picked up by Cabal < 1.23 (e.g. foo/*.js -> foo/jsquery.min.js) - print a warning about new glob semantics (what you propose).
  • If the package uses new glob features, but there is no cabal-version: >1.23 constraint - fatal error.
  • If the package uses new glob features, and there is an appropriate constraint - business as usual.

@@ -260,6 +264,7 @@ test-suite unit-tests
UnitTests.Distribution.Compat.ReadP
UnitTests.Distribution.Simple.Program.Internal
UnitTests.Distribution.Utils.NubList
UnitTests.Distribution.Glob
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should now be UnitTests.Distribution.Utils.Glob.

matches x (CharLiteral y) = x == y
matches x (Range start end) = start <= x && x <= end

-- | A safe version of 'tail'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tailMay from the safe library. Since we can't add a dependency on safe, let's rename it to tailMay and move to Distribution.Simple.Utils.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah cool, didn't know that, thanks! Will do.

... and add to Distribution.Utils.Safe
@@ -0,0 +1,6 @@
module Distribution.Utils.Safe where
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Distribution.Simple.Utils already imports Distribution.Utils.Glob, to avoid circular imports, I added tailMay to this new module, and then also re-exported it from Distribution.Simple.Utils. Is that ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely.

@hdgarrood
Copy link
Contributor Author

If the package uses only pre-1.23 glob features, but there are files that wouldn't be picked up by Cabal < 1.23 (e.g. foo/*.js -> foo/jsquery.min.js) - print a warning about new glob semantics (what you propose).

Just checking I've understood you - in this case, which semantics should be used?

@23Skidoo
Copy link
Member

23Skidoo commented Apr 5, 2015

Just checking I've understood you - in this case, which semantics should be used?

New semantics is used in all cases, but a warning should be printed when it differs from the old.

@BardurArantsson
Copy link
Collaborator

Hi all,

I must say that like some of the commenters here, I've never understood the actual motivation for restricting globs. I mean, I get that it's usually a good thing to avoid excessive complexity, but if there's a reasonable test suite that particular objection fades away.

Protecting users from themselves is not a useful goal when those users are presumably expert users. If users are worried that e.g. password files (or the like) get included, they should be doing actual audits of their Continuous Integration builds and/or source distribution files (cabal sdist; check tarball for unexpected files), not relying on a general tool like Cabal to "protect them" (which it won't anyway since it doesn't actually check for password-like things).

Anyway, that was my 0.02€'s worth. :)

@ttuegel
Copy link
Member

ttuegel commented Apr 8, 2015

I think @mietek makes a very strong case for dropping globs altogether, in favor of allowing directories. I propose going even further. Let us deprecate data-files in favor of a new field data-dir which takes exactly one directory name; all the files and directories below this directory are included. Consider these advantages:

  1. The semantics literally could not be easier to understand. Perhaps we would add an exception that hidden files and directories not be included, but the semantics of the field remain simple to explain.
  2. The implementation is very simple.
  3. There are no potential backward-compatibility issues, except that new versions of Cabal may emit a warning when data-files is used.
  4. Package authors must strictly segregate data files from other files, virtually eliminating the possibility of mistakenly including non-data files.

I think the last point is the most important. Our language (Haskell) strictly regulates allowed behavior, forcing us to organize our programs so that it is difficult to make certain classes of error. Why do we accept less from our package manager?

@BardurArantsson
Copy link
Collaborator

"... which takes exactly one directory name..."

No, please don't. This is imposes more restrictions on anyone who's developing using a simple http server just serving their local files.

@mietek
Copy link
Contributor

mietek commented Apr 8, 2015

@BardurArantsson: Can you explain? I don’t see a problem with @ttuegel’s proposal.

@ttuegel
Copy link
Member

ttuegel commented Apr 8, 2015

No, please don't. This is imposes more restrictions on anyone who's developing using a simple http server just serving their local files.

How? The only reason to include data-files is so they will be packaged in the tarball and installed elsewhere on the system when you do cabal install. Neither of these are desirable features for a simple local-file HTTP server. Could you elaborate on your use case?

@BardurArantsson
Copy link
Collaborator

Maybe I misunderstood, but forcing a single directory for "include everything" (which is what I understood your suggestion to mean), would mean that I would have to go out of my way to force all the html, css, &c to live in a single directory. Which I may not want while I'm actually developing things. (I get that this is probably good practice in general, but it just seems like an arbitrary restriction for no good reason... other that an extreme aversion to complexity.)

@BardurArantsson
Copy link
Collaborator

(Sorry about the immediate follow-up.)

One of my use cases is actually a little bit outside the stereotypical web-like cases: I want to distribute a standard config file and some bits which are just incidental things that were easier to distribute as "raw" files instead of trying to embed them properly (as ByteStrings or such, through TH). See the .cabal file of the "hums" package: https://hackage.haskell.org/package/hums-0.7.0 )

@ttuegel
Copy link
Member

ttuegel commented Apr 8, 2015

@BardurArantsson Would it suffice for your use cases if we allowed specifying data-dir multiple times, such that the occurrences are additive? For example,

data-dir: html
data-dir: css
data-dir: img

to include those three directories.

An aside, I just discovered that we already have a data-dir field, whose purpose is to make the data-files field annoyingly difficult to comprehend. So, the new behavior would be conditional on some version constraint. Say, if cabal-version: >= 1.23 then data-dir includes all files recursively and it's an error to have a data-files field. If cabal-version: <1.23, then the old semantics hold.

The existing behavior is definitely too complicated. If a regular Cabal contributor and a long-time Haskell user doesn't understand the behavior of (what should be) a simple field, it's too complicated.

@mietek
Copy link
Contributor

mietek commented Apr 8, 2015

I think it’s reasonable to allow only a single data-dir in a Cabal package description, as it would correspond exactly to the directory available at run-time. If this forces the user to keep their source repository organised, then so be it.

@hdgarrood
Copy link
Contributor Author

I agree, I like @ttuegel's proposal a lot as well. If I was developing some kind of web thing, I wouldn't mind putting all of html, css, javascript as subdirectories of a single static directory. In fact, I'm pretty sure I would choose to do that, even for projects that weren't in Haskell.

@23Skidoo
Copy link
Member

23Skidoo commented Apr 9, 2015

Looking at @mietek's examples, it looks like the only glob features that could be useful are *, {} and **. The examples can be rewritten as follows:

  1. static/**/*.{css,svg,eot,ttf,woff,ico,png,jpg,html}
  2. Imports.hs, static/**/*.{svg,eot,ttf,woff,css,js}
  3. static/**/*.{css,js,htm}
  4. examples/shellmate8032, examples/*.{hs,html,css,png,ico,js}

Which seems to confirm my opinion that we only need a subset of the Bash pattern syntax.

The issue of data-files being confused with extra-source-files is interesting. I'm not sure that deprecating data-files in favour of data-dir will stop people from confusing them, though I like this proposal.

@BardurArantsson
Copy link
Collaborator

Would it suffice for your use cases if we allowed specifying data-dir multiple times, such that the occurrences are additive? For example,

Sorry for not getting back until now. Yes, this would be sufficient for my purposes, and I suspect most web-app-like things. (Though there may be issues if one is storing, say, high-quality SVGs together with rendered PNGs and only wants to distribute the PNGs.)

And, yes, I need multiple dirs.

(I still don't see why allowing full glob-syntax would be horribly bad, but whatever. I'm still vehemently opposed to glob-like syntax that doesn't adhere to anything found in an actual real shell. Principle of Least Surprise, people!)

@hdgarrood
Copy link
Contributor Author

Closing this, because a) we seem to be at an impasse, and b) I think specifying whole directories is probably better anyway.

@hdgarrood hdgarrood closed this Jan 11, 2016
@hdgarrood
Copy link
Contributor Author

I'll leave my code here, just in case anyone does want it.

@BardurArantsson
Copy link
Collaborator

Sad to see this die by bikeshedding :(. It's still a silly bug that we can't get reliable inclusion of data files using wildcards.

Didn't everyone more-or-less agree that multiple data-dirs would be sufficient?

@mietek
Copy link
Contributor

mietek commented Jul 2, 2016

No, @BardurArantsson. @ttuegel proposed a new data-dir field which would take a single directory name. @hdgarrood, @23Skidoo, and myself liked the proposal, while you didn’t.

quasicomputational added a commit to quasicomputational/cabal that referenced this pull request Apr 25, 2018
These are inspired by a plan described in a comment in haskell#2522, and only
implement a quite limited form of recursive matching: only a single **
wildcard is accepted, it must be the final directory, and, if a **
wildcard is present, the file name must include a wildcard.

Or-patterns are not implemented, for simplicity.

Closes haskell#3178, haskell#2030.
quasicomputational added a commit to quasicomputational/cabal that referenced this pull request Apr 25, 2018
These are inspired by a plan described in a comment in haskell#2522, and only
implement a quite limited form of recursive matching: only a single **
wildcard is accepted, it must be the final directory, and, if a **
wildcard is present, the file name must include a wildcard.

Or-patterns are not implemented, for simplicity.

Closes haskell#3178, haskell#2030.
quasicomputational added a commit to quasicomputational/cabal that referenced this pull request Apr 28, 2018
These are inspired by a plan described in a comment in haskell#2522, and only
implement a quite limited form of recursive matching: only a single **
wildcard is accepted, it must be the final directory, and, if a **
wildcard is present, the file name must include a wildcard.

Or-patterns are not implemented, for simplicity.

Closes haskell#3178, haskell#2030.
quasicomputational added a commit to quasicomputational/cabal that referenced this pull request Apr 29, 2018
These are inspired by a plan described in a comment in haskell#2522, and only
implement a quite limited form of recursive matching: only a single **
wildcard is accepted, it must be the final directory, and, if a **
wildcard is present, the file name must include a wildcard.

Or-patterns are not implemented, for simplicity.

Closes haskell#3178, haskell#2030.
quasicomputational added a commit to quasicomputational/cabal that referenced this pull request May 6, 2018
These are inspired by a plan described in a comment in haskell#2522, and only
implement a quite limited form of recursive matching: only a single **
wildcard is accepted, it must be the final directory, and, if a **
wildcard is present, the file name must include a wildcard.

Or-patterns are not implemented, for simplicity.

Closes haskell#3178, haskell#2030.
23Skidoo pushed a commit that referenced this pull request May 8, 2018
These are inspired by a plan described in a comment in #2522, and only
implement a quite limited form of recursive matching: only a single **
wildcard is accepted, it must be the final directory, and, if a **
wildcard is present, the file name must include a wildcard.

Or-patterns are not implemented, for simplicity.

Closes #3178, #2030.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants