Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/shfmt: change the language default from "bash" to "auto" #622

Closed
mvdan opened this issue Oct 28, 2020 · 8 comments · Fixed by #796
Closed

cmd/shfmt: change the language default from "bash" to "auto" #622

mvdan opened this issue Oct 28, 2020 · 8 comments · Fixed by #796

Comments

@mvdan
Copy link
Owner

mvdan commented Oct 28, 2020

That is, auto-detection of what language the input shell should be in. This is because, right now, running something like shfmt -l -w . on a large repository will simply use the default of -ln=bash for all files. One can use -ln=posix to change that, but then all files are parsed as POSIX.

It's not terribly common to mix different shell languages in a single repository, but it happens. For example, imagine a repository which has a few simple POSIX shell scripts for extreme portability, some complex Bash scripts for development, and a few Bats files for unit testing other shell scripts. Right now, this would require at least three shfmt calls, and manually hard-coding which files are in which language.

We can do better. The files themselves often tell us what language they are in via the filename extension and the shebang.

I propose the following algorithm to auto-detect the language. When a language is selected, the following steps aren't considered.

  1. If the filename ends with .bash, .mksh, or .bats, we use that language.
  2. If the content begins with a valid shell shebang like #!/bin/sh or #!/usr/bin/env bash, we use that language. Here, sh means POSIX shell.
  3. We fall back to the current default, assuming Bash.

Step 1 does not assume .sh means POSIX, because Bash scripts with that extension are too common. If a file is meant to be POSIX shell, give it a sh shebang, which will be matched by step 2.

Step 3 allows us to not break backwards compatibility. If we don't know what language the input is in, we'll just keep the current conservative fallback that is Bash.

To my mind, the only downside to this change is that it adds a bit of complexity; we now need to explain what -ln=auto means. But I think that should be doable in a few lines in the added man page, just like we did in three list items above.

--

This is a reboot of #429, which I actually forgot about. Fun how I came up with almost exactly the same idea again. The reason I think this is worth considering again is two-fold:

  • The default of -ln=bash is still too stiff, in my opinion. Support for Bats is a clear example; a project having both Bash and Bats files is entirely normal.
  • EditorConfig does kind of mitigate this problem, but having to enter [*.bats]\nshell_variant = "bats" is silly. Plus, since EditorConfig is entirely filename-based, it does not cover Step 2 from this proposal.

CC @jansorg @kolyshkin @lassik @kaey since you all contributed to related issues. If we reach consensus on this, I'd implement this for 3.3.0, which should come out in early 2021.

@jansorg
Copy link
Contributor

jansorg commented Oct 28, 2020

Sounds good to me!
You might want to consider to make this the same or similar to ShellCheck, which has its own detection logic. That would help to be able to validate and format files without in a project using different ways to markup the file type.
I haven't looked up its detection logic in the code, but it's additionally using # shellcheck shell= directives, https://github.com/koalaman/shellcheck/wiki/Directive#shell.

@mvdan
Copy link
Owner Author

mvdan commented Oct 28, 2020

It seems like their algorithm is defined in koalaman/shellcheck@a4b9cec:

  1. ShellCheck directive
  2. Shebang
  3. File extension

I don't think I want to teach shfmt about shellcheck directives. It's an explicit design decision of our parser to not treat any comments as special directives. I think shebangs are the only universal exception here, and already provide what we need.

I also slightly disagree with the order. If a foo.bash file contains #!/bin/sh, it should be treated as Bash, not as POSIX. You can argue that such a file would be contradictory and I would agree, but filename extension should trump shebangs because it's the first thing a user will see.

I agree that it would be nicer if all tools would agree on a "shell language detection" standard, but given that ShellCheck went for a custom algorithm based on directives, I'm afraid that's already an impossibility. shfmt already has a -f flag to find shell files though, so I could add a similar mechanism to have shfmt tell you what language it thinks a script is written in, and then you could apply that to the rest of your shell tools/logic.

@jansorg
Copy link
Contributor

jansorg commented Oct 28, 2020

Thanks, that makes sense.
Regarding the autodetection of foo.bash with #!/bin/sh: This controls runtime behaviour (Bash will run in posix mode when executed as sh). ShellCheck needs to know about this, but not shfmt. Therefore the detection logic above seems right to me for shfmt.

@kaey
Copy link
Contributor

kaey commented Oct 28, 2020

Consider following repo:

cmd/prog1
cmd/prog2
lib/funcs.sh

progs are executable scripts and contain #!/bin/ash shebang.
funcs.sh is sources by both progs and has no shebang.

In this example autodetection will detect progs as posix(probably?) and funcs as bash.

@mvdan
Copy link
Owner Author

mvdan commented Oct 28, 2020

The current proposal wouldn't understand posix-y shells like ash or dash, so your prog files would default to bash. We could extend the rules in the future, but for now I want to keep them simple.

Similarly, since funcs.sh has no shebang and the .sh extension is inconclusive (see the explanation right after the steps in the original post), it would also parse as bash.

In general, if you have a library POSIX shell file that's meant to be sourced but not executed directly, I think your best bet is to add an explicit POSIX shebang at the top of it, like #!/bin/sh. The script can remain without the executable permission bit to signify that it's not meant to be a program to be run directly.

If you don't like having shebangs in library files, then your other option is EditorConfig, like:

[lib/*]
shell_variant = "posix"

# or for all .sh files, to override the "bash" fallback
[*.sh]
shell_variant = "posix"

@kaey
Copy link
Contributor

kaey commented Oct 29, 2020

The current proposal wouldn't understand posix-y shells like ash or dash, so your prog files would default to bash.

That's another inconsistency

In general, if you have a library POSIX shell file that's meant to be sourced but not executed directly, I think your best bet is to add an explicit POSIX shebang at the top of it, like #!/bin/sh

Nobody's going to do that.

If you don't like having shebangs in library files, then your other option is EditorConfig

Current rules are pretty simple - parse everything as bash unless EditorConfig says otherwise.

This proposal lets group of programmers targeting bash avoid writing EditorConfig, but group targeting posix shell (and other implementations) will have to deal with some inconsistencies.

In the end it's a question about whether one group is large enough that it's worth creating more rules (more code, more docs) for solving their problem, while adding some inconsistencies for the other group.

@mvdan
Copy link
Owner Author

mvdan commented Oct 29, 2020

That's another inconsistency

Like I said, we could teach the tool to understand other posix-y shells in the future.

Nobody's going to do that.

Nobody is forced to do that, either. The bash fallback is still there if you don't care. And you can always be explicit via a flag or via EditorConfig if you want.

Current rules are pretty simple - parse everything as bash unless EditorConfig says otherwise.

The rules are simple, but IMO the default is too rigid. It's fine for the parser to have extremely simple rules, where you always have to say what language you want if it's not Bash. But shfmt is meant to be more user-friendly, which is why EditorConfig was added in the first place.

group targeting posix shell (and other implementations) will have to deal with some inconsistencies.

I disagree. If anything, almost nothing changes for them. If they weren't specifying posix, they'll just keep on getting the same bash fallback. If they're specifying posix in some way, they'll get posix. And they always have the option to explicitly state that all of their code is posix.

@mvdan mvdan removed the help wanted label Feb 26, 2021
mvdan added a commit that referenced this issue Jan 22, 2022
The man page describes the logic in detail.
In short, the default is no longer just "assume bash",
as it will try to detect the language from the filename and shebang,
and only fall back to "bash" if those fail.

Users who specify a language themselves, either via -ln or editorconfig,
should see no change in behavior whatsoever.
Those relying on the default should generally see an improvement,
as they might no longer have issues with directories mixing languages.

One potential regression is cases like:

	$ cat foo.sh
	foo=(bar baz) # actually bash

That is, bash scripts which use "sh" extensions or shebangs.
"sh" stands for POSIX shell, so the new logic will parse them as such,
whereas the old logic would parse them as bash.
Still, that seems like an easy fix on the part of the user:
they can either change the extension or shebang to be clear,
or they can explicitly set up the option to "bash".

Note that the added "auto" logic applies to all kinds of inputs:
explicit files, directories to be walked, and standard input.

Fixes #622.
@mvdan
Copy link
Owner Author

mvdan commented Jan 22, 2022

It's taken me a while to come back to this - I had a half-finished local branch from over a year ago, and I hadn't realised it would take me a few extra hours to iron out the last few details with tests.

Please give #796 a try and let me know how it works :) Reviews are also very much welcome.

mvdan added a commit that referenced this issue Jan 27, 2022
The man page describes the logic in detail.
In short, the default is no longer just "assume bash",
as it will try to detect the language from the filename and shebang,
and only fall back to "bash" if those fail.

Users who specify a language themselves, either via -ln or editorconfig,
should see no change in behavior whatsoever.
Those relying on the default should generally see an improvement,
as they might no longer have issues with directories mixing languages.

One potential regression is cases like:

	$ cat foo.sh
	foo=(bar baz) # actually bash

That is, bash scripts which use "sh" extensions or shebangs.
"sh" stands for POSIX shell, so the new logic will parse them as such,
whereas the old logic would parse them as bash.
Still, that seems like an easy fix on the part of the user:
they can either change the extension or shebang to be clear,
or they can explicitly set up the option to "bash".

Note that the added "auto" logic applies to all kinds of inputs:
explicit files, directories to be walked, and standard input.

Fixes #622.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants