-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor shell_escape_winsomely() and add escaping function for CMD.EXE syntax #34111
Conversation
This comment has been minimized.
This comment has been minimized.
Lines 177 to 178 in a6bf74a
|
Implementation of the raw-string escaping convention. Can also be used to escape command-line arguments for Windows C/C++/Julia applications, which use the same \" escaping convention as Julia non-standard string literals.
Using escape_raw_string() make this function much easier to understand. Name changed, as the function is not related to any shell.
else | ||
write(io, ' ') # separator | ||
end | ||
if isempty(arg) || occursin(r"[ \t\"]", arg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if isempty(arg) || occursin(r"[ \t\"]", arg) | |
if isempty(arg) || any(c -> c in (' ', '\t', '"'), arg) |
(faster)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on the length. On my computer, if length(arg)<27
then there are about 50 ns setup penalty associated with regex use:
julia> using BenchmarkTools
julia> arg = 'a'^1; @btime occursin(r"[ \t\"]", $arg) ; @btime any(c -> c in (' ', '\t', '"'), $arg)
56.275 ns (0 allocations: 0 bytes)
7.988 ns (0 allocations: 0 bytes)
For length(arg)==27
there is no difference:
julia> arg = 'a'^27; @btime occursin(r"[ \t\"]", $arg) ; @btime any(c -> c in (' ', '\t', '"'), $arg)
88.455 ns (0 allocations: 0 bytes)
89.344 ns (0 allocations: 0 bytes)
For length(arg)>27
regex is faster, asymptotically twice as fast:
julia> arg = 'a'^1000; @btime occursin(r"[ \t\"]", $arg) ; @btime any(c -> c in (' ', '\t', '"'), $arg)
1.380 μs (0 allocations: 0 bytes)
2.673 μs (0 allocations: 0 bytes)
false
So overall, I still prefer using a regular expression here (and I suspect they might be beneficially used in some of the other shell escaping functions as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it more common to have arg < 27
than long arguments with arg>27
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are talking a maximum of ~50 ns benchmark overhead here on my 8-year-old PC (the time in which a bit travels 10 metres on a cable), which is many orders of magnitude less than the overhead of invoking a new process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either is fine with me.
base/shell.jl
Outdated
if c == '"' | ||
quoted = !quoted | ||
else | ||
if !quoted && c in ( '(', ')', '!', '^', '<', '>', '&', '|' ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if !quoted && c in ( '(', ')', '!', '^', '<', '>', '&', '|' ) | |
if !quoted && c in ( '%', '(', ')', '!', '^', '<', '>', '&', '|' ) |
(actually, (
is not special, as that stack overflow link points out, but I think it'd just confused folks not to include it anyways)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remain confused about the exact meta character nature of (
and )
, so prefer to keep both escaped for now.
base/shell.jl
Outdated
of quotation marks on the command line. | ||
|
||
The percent sign (`%`) is not escaped, therefore shell variable | ||
references (like `%USER%`) will still be substituted by `cmd.exe`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not right: this is what shell_escape_wincmd
exists to prevent.
references (like `%USER%`) will still be substituted by `cmd.exe`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not so easy! The problem is that CMD.EXE performs variable substitution at the very beginning, before processing the ^ and " escaping characters. See the following experiment to demonstrate this:
C:\Users\mgk25>echo %USER%
dc\mgk25
C:\Users\mgk25>echo ^%USER%
dc\mgk25
C:\Users\mgk25>echo ^%USER^%
%USER%
C:\Users\mgk25>set USER^^=foo
C:\Users\mgk25>echo ^%USER^%
foo
So if you try to escape % with ^%, you may in fact just be adding a ^ to the variable name that will be substituted. Yes, I know its crazy, but Windows always was an acquired taste ...
I believe it is impossible for shell_escape_wincmd
to escape % effectively. I can reformulate the documentation to clarify that passing through % was not our choice, but was done to clarify that escaping it would not have been effective anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, wow. Yeah, I completely missed that. (I doubt I'm alone too, since it means 8.10 in http://daviddeley.com/autohotkey/parameters/parameters.htm#WINCRULES is similarly wrong). It's very unexpected to realize that %
is not a cmd shell metacharacter, so ^%
is no different from say ^A
. It's instead a half separate pass of the parser.
Following the someone more detailed guide at https://stackoverflow.com/questions/4094699/how-does-the-windows-command-interpreter-cmd-exe-parse-scripts/7970912#7970912, I realized there's a workaround for rules that is valid for your experiment above. That is, an almost-working escape sequence for %
is %^\n
, as can be seen in the below example:
julia> run(pipeline(`cmd /Q /K set PROMPT=\?`, stdin=IOBuffer("echo %^\nOS%^\n\n\n")))
?More? More? More? %OS%
?Process(`cmd /Q /K set 'PROMPT=?'`, ProcessExited(0))
Note that passing this to /C
fails though because of rule 1.05: "the remainder of the line from the onward" is dropped (ignored). Note that we are gaining an extra newline on the output too, since we're running afoul of the quoting rules for \n
(it's ^\n\n
) to handle the trailing %
—there would need to be an exception to the rule for when %
is at the end of the line already.
Also, perhaps we should mention that the user is advised to normally pass the S flag as such: /S /C " $the $command $line "
to avoid losing their final "
mark in $line
contents under very rare circumstances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is starting to get rather esoteric. I actually want to pass through % unescaped, because my use case is building a command line to be sent with ssh to another Windows server, where it can be perfectly useful to have variables such as %APPDATA% substituted remotely (as I don't have local access to them). Also, in the ssh use case, I am not calling CMD.EXE, so I can't pass any flags to it, the remote sshd is doing that.
If someone really needs a variant of this function that also goes to such great lengths to escape %
, then I would be happy to draft a separate function shell_escape_wincmd2
that has that property. However, the algorithm used will be completely different, because in order to escape every %
with your hack, I can no longer pass through pairs of "
unescaped, as they too might contain a %
. Escaping every %
means also escaping every "
, and I'd rather not do the latter as it makes all the most common cases look very odd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's really abysmal behavior/design by CMD since it means you can't easily have one script call another (say, on another machine). I guess we could also attempt to hack it with ENV["PCTESC"]="%"; replace(cmd, "%" => "%PCTESC%")
, but that seems pretty brittle too.
characters `()!^<>&|` processed by the Windows `cmd.exe` shell: it | ||
places a `^` in front of any metacharacter that follows an even number | ||
of quotation marks on the command line. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result is safe to pass as an argument to a command call being processed by `CMD.exe /c` and will be received verbatim by the target application (or this function will fail with an ArgumentError). | |
This function may be useful in concert with the `windows_verbatim` flag to [`Cmd`](@ref) when constructing process pipelines. | |
!warning | |
The argument parsing done by CMD when calling batch files (either inside `.bat` files or as arguments to them) is not compatible with the output of this function. | |
!important | |
Due to a peculiar behavior of the CMD parser/interpreter, each command after a literal `|` character | |
(indicating a command pipeline) must have `shell_escape_wincmd` applied twice since it will be parsed twice by CMD. For example: | |
``` | |
to_print = "All for 1 & 1 for all!" | |
run(Cmd(Cmd(["cmd /c \"break | echo \$(shell_escape_wincmd(shell_escape_wincmd(to_print)))"]), windows_verbatim=true)) | |
``` |
(adopting some of the advisory text from #33474):
Lines 324 to 345 in 09c1d61
shell_escape_CMDly(arg::AbstractString) -> String | |
The unexported `shell_escape_CMDly` function takes a string and escapes any special characters | |
in such a way that it is safe to pass it as an argument to some `CMD.exe`. This may be useful | |
in concert with the `windows_verbatim` flag to [`Cmd`](@ref) when constructing process | |
pipelines. | |
See also [`shell_escape_PWSHly`](@ref). | |
# Example | |
```jldoctest | |
julia> println(shell_escape_CMDly("\"A B\\\" & C")) | |
^"A B\\^" ^& C | |
!important | |
Due to a peculiar behavior of the CMD, each command after a literal `|` character | |
(indicating a command pipeline) must have `shell_escape_CMDly` applied twice. For example: | |
``` | |
to_print = "All for 1 & 1 for all!" | |
run(Cmd(Cmd(["cmd /c \"break | echo \$(shell_escape_CMDly(shell_escape_CMDly(to_print)))"]), windows_verbatim=true)) | |
``` | |
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid I don't quite understand yet that suggested addition and I am not really familiar yet with all the behind-the-scenes processing that may go on during run(Cmd(Cmd(["cmd /c \"...
on a Windows machine. My own use case and expertise is preparing command lines that get sent via ssh to a Windows server running OpenSSH sshd (e.g., in Distributed), where no libuv processing of command-line arguments passed to Windows processes is involved. I couldn't reproduce yet myself the effect referred to in this suggestion, and therefore would be more comfortable if we left this to a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all the behind-the-scenes processing
There is none. That's the point of using that syntax and flag: to disable all helpers and get the raw string without libuv processing or such.
Though I would perhaps further clarify here that it may be best passed as CMD.exe /S /C " prog args "
(with the '/S' and surrounding " pair).
* ensure that only an even number of double quotes is passed through unescaped, such that the quote flag in CMD.EXE remains unset afterwards * clarify in documentation why no attempt to escape `%` * operates now on bytes rather than Unicode characters (to avoid UTF-8 decoding overhead) * @ is now escaped at start of string * more optimizations and test cases
@vtjnash I just committed a rewrite of shell_escape_wincmd() that now passes through unescaped double-quotes only in pairs, to ensure that the quote flag in CMD.EXE remains cleared afterwards. This is essential when using shell_escape_wincmd() multiple times in a line to build up a more complex command line. |
…osoft_c_args from PR JuliaLang#34111
No longer assume that AbstractString uses some ASCII compatible encoding (such as UTF-8), but call the UTF-8 decoder for each character (even though this causes 30-60% runtime overhead compared to the previous version).
…osoft_c_args from PR JuliaLang#34111
Any chance this could still go into 1.4, such that the old |
write(io, SubString(s,i,j)) | ||
i = j | ||
else | ||
if c in ('"', '(', ')', '!', '^', '<', '>', '&', '|') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that cmd /?
emits this list, but that I don't know where this list comes from:
The special characters that require quotes are:
<space>
&()[]{}^=;!'+,`~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't spotted that list before, but it also seems wrong, at least for my use case of command invocation. For example: typing into the command line julia -e "print(\"hi\")">b
or julia -e "print(\">\")"
quite clearly demonstrates that any >
in a CMD.EXE
command-line argument needs to be escaped to not trigger stdout redirect by CMD.EXE
. But the >
character is not listed in that last line of cmd /?
.
Reading the text before, it may well be meant in the context of the “completion” function, i.e. it may be a list of characters that the authors of the completion function worried about, and that function may work differently in different contexts. That may not be the same list of characters that we need to escape in command invocation, i.e. when sending strings such as set var=value&&pushd directory&&julia -e "print(\"...\")"
. When you ask for filename completion after having typed a filename that starts with >
, you may want to redirect rather than have the >
escaped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but note that the list is intentionally incomplete with respect to our usage since \/:*?<>|
(plus newline) can't appear in the result of the completion, so any of those might also be candidates for this list. The question then is where some of the others come from??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Jan Erik / Dave Benham description of the CMD.EXE parser does say that under certain circumstances in Phase 7 if the quote flag is off, then when testing if a command token is an internal command, it “break[s] the command token before the first occurrence of +
/
[
]
<space>
<tab>
,
;
or =
”, which probably accounts for the remaining characters in that list. I'm not claiming to fully understand yet all of this, but I doubt it is relevant to our use cases. These characters seem just used as token delimiters, not as special characters that trigger substitutions of their own in regular command-line arguments.
[Note that my original approach (which did not factor Microsoft C ARGV quoting and CMD.EXE quoting into separate functions) would have been able to use quotes more aggressively (i.e., quote something for the C library to unquote because CMD.EXE might interpret it as a meta character, not because the C library might), and thereby guarantee far more easily that none of these contexts are ever reached. But I think we are fine. Can you construct any use case where my proposed current implementation is not already doing the right thing?]
Refactored shell_escape_winsomely() into two separate functions escape_microsoft_c_args() and escape_raw_string(), such that the latter becomes separately available (needed for #29580) and to make it easier to understand.
As requested by @vtjnash in #33465.
Also added Windows
CMD.EXE
escaping function shell_escape_wincmd() as requested by @vtjnash to be separated out from #30614 into base/shell.jl (for SSHManager to be able to prepare Windows command lines)