Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor shell_escape_winsomely() and add escaping function for CMD.EXE syntax #34111

Closed
wants to merge 7 commits into from

Conversation

mgkuhn
Copy link
Contributor

@mgkuhn mgkuhn commented Dec 15, 2019

Refactored shell_escape_winsomely() into two separate functions escape_microsoft_c_args() and escape_raw_string(), such that the latter becomes separately available (needed for #29580) and to make it easier to understand.
As requested by @vtjnash in #33465.
Also added Windows CMD.EXE escaping function shell_escape_wincmd() as requested by @vtjnash to be separated out from #30614 into base/shell.jl (for SSHManager to be able to prepare Windows command lines)

@mgkuhn

This comment has been minimized.

@mgkuhn mgkuhn changed the title Shellescape Refactor shell_escape_winsomely() and add escaping function for CMD.EXE syntax Dec 15, 2019
@vchuravy
Copy link
Member

regex.jl is loaded after shell.jl, see

julia/base/Base.jl

Lines 177 to 178 in a6bf74a

include("shell.jl")
include("regex.jl")

Implementation of the raw-string escaping convention. Can also be
used to escape command-line arguments for Windows C/C++/Julia
applications, which use the same \" escaping convention as
Julia non-standard string literals.
Using escape_raw_string() make this function much easier to
understand. Name changed, as the function is not related to any shell.
else
write(io, ' ') # separator
end
if isempty(arg) || occursin(r"[ \t\"]", arg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isempty(arg) || occursin(r"[ \t\"]", arg)
if isempty(arg) || any(c -> c in (' ', '\t', '"'), arg)

(faster)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on the length. On my computer, if length(arg)<27 then there are about 50 ns setup penalty associated with regex use:

julia> using BenchmarkTools
julia> arg = 'a'^1; @btime occursin(r"[ \t\"]", $arg) ; @btime any(c -> c in (' ', '\t', '"'), $arg)
  56.275 ns (0 allocations: 0 bytes)
  7.988 ns (0 allocations: 0 bytes)

For length(arg)==27 there is no difference:

julia> arg = 'a'^27; @btime occursin(r"[ \t\"]", $arg) ; @btime any(c -> c in (' ', '\t', '"'), $arg)
  88.455 ns (0 allocations: 0 bytes)
  89.344 ns (0 allocations: 0 bytes)

For length(arg)>27 regex is faster, asymptotically twice as fast:

julia> arg = 'a'^1000; @btime occursin(r"[ \t\"]", $arg) ; @btime any(c -> c in (' ', '\t', '"'), $arg)
  1.380 μs (0 allocations: 0 bytes)
  2.673 μs (0 allocations: 0 bytes)
false

So overall, I still prefer using a regular expression here (and I suspect they might be beneficially used in some of the other shell escaping functions as well).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it more common to have arg < 27 than long arguments with arg>27 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are talking a maximum of ~50 ns benchmark overhead here on my 8-year-old PC (the time in which a bit travels 10 metres on a cable), which is many orders of magnitude less than the overhead of invoking a new process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either is fine with me.

base/shell.jl Outdated Show resolved Hide resolved
base/shell.jl Outdated
if c == '"'
quoted = !quoted
else
if !quoted && c in ( '(', ')', '!', '^', '<', '>', '&', '|' )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if !quoted && c in ( '(', ')', '!', '^', '<', '>', '&', '|' )
if !quoted && c in ( '%', '(', ')', '!', '^', '<', '>', '&', '|' )

(actually, ( is not special, as that stack overflow link points out, but I think it'd just confused folks not to include it anyways)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remain confused about the exact meta character nature of ( and ), so prefer to keep both escaped for now.

base/shell.jl Outdated
of quotation marks on the command line.

The percent sign (`%`) is not escaped, therefore shell variable
references (like `%USER%`) will still be substituted by `cmd.exe`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not right: this is what shell_escape_wincmd exists to prevent.

Suggested change
references (like `%USER%`) will still be substituted by `cmd.exe`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not so easy! The problem is that CMD.EXE performs variable substitution at the very beginning, before processing the ^ and " escaping characters. See the following experiment to demonstrate this:

C:\Users\mgk25>echo %USER%            
dc\mgk25                     
                             
C:\Users\mgk25>echo ^%USER% 
dc\mgk25                     
                             
C:\Users\mgk25>echo ^%USER^% 
%USER%                       
                             
C:\Users\mgk25>set USER^^=foo

C:\Users\mgk25>echo ^%USER^%  
foo                          

So if you try to escape % with ^%, you may in fact just be adding a ^ to the variable name that will be substituted. Yes, I know its crazy, but Windows always was an acquired taste ...

I believe it is impossible for shell_escape_wincmd to escape % effectively. I can reformulate the documentation to clarify that passing through % was not our choice, but was done to clarify that escaping it would not have been effective anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wow. Yeah, I completely missed that. (I doubt I'm alone too, since it means 8.10 in http://daviddeley.com/autohotkey/parameters/parameters.htm#WINCRULES is similarly wrong). It's very unexpected to realize that % is not a cmd shell metacharacter, so ^% is no different from say ^A. It's instead a half separate pass of the parser.

Following the someone more detailed guide at https://stackoverflow.com/questions/4094699/how-does-the-windows-command-interpreter-cmd-exe-parse-scripts/7970912#7970912, I realized there's a workaround for rules that is valid for your experiment above. That is, an almost-working escape sequence for % is %^\n, as can be seen in the below example:

julia> run(pipeline(`cmd /Q /K set PROMPT=\?`, stdin=IOBuffer("echo %^\nOS%^\n\n\n")))

?More? More? More? %OS%


?Process(`cmd /Q /K set 'PROMPT=?'`, ProcessExited(0))

Note that passing this to /C fails though because of rule 1.05: "the remainder of the line from the onward" is dropped (ignored). Note that we are gaining an extra newline on the output too, since we're running afoul of the quoting rules for \n (it's ^\n\n) to handle the trailing %—there would need to be an exception to the rule for when % is at the end of the line already.

Also, perhaps we should mention that the user is advised to normally pass the S flag as such: /S /C " $the $command $line " to avoid losing their final " mark in $line contents under very rare circumstances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is starting to get rather esoteric. I actually want to pass through % unescaped, because my use case is building a command line to be sent with ssh to another Windows server, where it can be perfectly useful to have variables such as %APPDATA% substituted remotely (as I don't have local access to them). Also, in the ssh use case, I am not calling CMD.EXE, so I can't pass any flags to it, the remote sshd is doing that.

If someone really needs a variant of this function that also goes to such great lengths to escape %, then I would be happy to draft a separate function shell_escape_wincmd2 that has that property. However, the algorithm used will be completely different, because in order to escape every % with your hack, I can no longer pass through pairs of " unescaped, as they too might contain a %. Escaping every % means also escaping every ", and I'd rather not do the latter as it makes all the most common cases look very odd.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's really abysmal behavior/design by CMD since it means you can't easily have one script call another (say, on another machine). I guess we could also attempt to hack it with ENV["PCTESC"]="%"; replace(cmd, "%" => "%PCTESC%"), but that seems pretty brittle too.

base/shell.jl Outdated Show resolved Hide resolved
base/shell.jl Outdated Show resolved Hide resolved
base/shell.jl Show resolved Hide resolved
characters `()!^<>&|` processed by the Windows `cmd.exe` shell: it
places a `^` in front of any metacharacter that follows an even number
of quotation marks on the command line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The result is safe to pass as an argument to a command call being processed by `CMD.exe /c` and will be received verbatim by the target application (or this function will fail with an ArgumentError).
This function may be useful in concert with the `windows_verbatim` flag to [`Cmd`](@ref) when constructing process pipelines.
!warning
The argument parsing done by CMD when calling batch files (either inside `.bat` files or as arguments to them) is not compatible with the output of this function.
!important
Due to a peculiar behavior of the CMD parser/interpreter, each command after a literal `|` character
(indicating a command pipeline) must have `shell_escape_wincmd` applied twice since it will be parsed twice by CMD. For example:
```
to_print = "All for 1 & 1 for all!"
run(Cmd(Cmd(["cmd /c \"break | echo \$(shell_escape_wincmd(shell_escape_wincmd(to_print)))"]), windows_verbatim=true))
```

(adopting some of the advisory text from #33474):

julia/base/shell.jl

Lines 324 to 345 in 09c1d61

shell_escape_CMDly(arg::AbstractString) -> String
The unexported `shell_escape_CMDly` function takes a string and escapes any special characters
in such a way that it is safe to pass it as an argument to some `CMD.exe`. This may be useful
in concert with the `windows_verbatim` flag to [`Cmd`](@ref) when constructing process
pipelines.
See also [`shell_escape_PWSHly`](@ref).
# Example
```jldoctest
julia> println(shell_escape_CMDly("\"A B\\\" & C"))
^"A B\\^" ^& C
!important
Due to a peculiar behavior of the CMD, each command after a literal `|` character
(indicating a command pipeline) must have `shell_escape_CMDly` applied twice. For example:
```
to_print = "All for 1 & 1 for all!"
run(Cmd(Cmd(["cmd /c \"break | echo \$(shell_escape_CMDly(shell_escape_CMDly(to_print)))"]), windows_verbatim=true))
```
"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I don't quite understand yet that suggested addition and I am not really familiar yet with all the behind-the-scenes processing that may go on during run(Cmd(Cmd(["cmd /c \"... on a Windows machine. My own use case and expertise is preparing command lines that get sent via ssh to a Windows server running OpenSSH sshd (e.g., in Distributed), where no libuv processing of command-line arguments passed to Windows processes is involved. I couldn't reproduce yet myself the effect referred to in this suggestion, and therefore would be more comfortable if we left this to a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the behind-the-scenes processing

There is none. That's the point of using that syntax and flag: to disable all helpers and get the raw string without libuv processing or such.

Though I would perhaps further clarify here that it may be best passed as CMD.exe /S /C " prog args " (with the '/S' and surrounding " pair).

* ensure that only an even number of double quotes is passed through
  unescaped, such that the quote flag in CMD.EXE remains unset afterwards

* clarify in documentation why no attempt to escape `%`

* operates now on bytes rather than Unicode characters (to avoid
  UTF-8 decoding overhead)

* @ is now escaped at start of string

* more optimizations and test cases
@mgkuhn
Copy link
Contributor Author

mgkuhn commented Dec 22, 2019

@vtjnash I just committed a rewrite of shell_escape_wincmd() that now passes through unescaped double-quotes only in pairs, to ensure that the quote flag in CMD.EXE remains cleared afterwards. This is essential when using shell_escape_wincmd() multiple times in a line to build up a more complex command line.

base/shell.jl Outdated Show resolved Hide resolved
base/shell.jl Outdated Show resolved Hide resolved
base/shell.jl Outdated Show resolved Hide resolved
mgkuhn added a commit to mgkuhn/julia that referenced this pull request Dec 23, 2019
No longer assume that AbstractString uses some ASCII compatible encoding
(such as UTF-8), but call the UTF-8 decoder for each character
(even though this causes 30-60% runtime overhead compared to the
previous version).
mgkuhn added a commit to mgkuhn/julia that referenced this pull request Jan 6, 2020
@mgkuhn
Copy link
Contributor Author

mgkuhn commented Jan 6, 2020

Any chance this could still go into 1.4, such that the old shell_escape_winsomely() doesn't end up in a release?

write(io, SubString(s,i,j))
i = j
else
if c in ('"', '(', ')', '!', '^', '<', '>', '&', '|')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that cmd /? emits this list, but that I don't know where this list comes from:

The special characters that require quotes are:
     <space>
     &()[]{}^=;!'+,`~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't spotted that list before, but it also seems wrong, at least for my use case of command invocation. For example: typing into the command line julia -e "print(\"hi\")">b or julia -e "print(\">\")"quite clearly demonstrates that any > in a CMD.EXE command-line argument needs to be escaped to not trigger stdout redirect by CMD.EXE. But the > character is not listed in that last line of cmd /?.

Reading the text before, it may well be meant in the context of the “completion” function, i.e. it may be a list of characters that the authors of the completion function worried about, and that function may work differently in different contexts. That may not be the same list of characters that we need to escape in command invocation, i.e. when sending strings such as set var=value&&pushd directory&&julia -e "print(\"...\")". When you ask for filename completion after having typed a filename that starts with >, you may want to redirect rather than have the > escaped.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but note that the list is intentionally incomplete with respect to our usage since \/:*?<>| (plus newline) can't appear in the result of the completion, so any of those might also be candidates for this list. The question then is where some of the others come from??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Jan Erik / Dave Benham description of the CMD.EXE parser does say that under certain circumstances in Phase 7 if the quote flag is off, then when testing if a command token is an internal command, it “break[s] the command token before the first occurrence of + / [ ] <space> <tab> , ; or =”, which probably accounts for the remaining characters in that list. I'm not claiming to fully understand yet all of this, but I doubt it is relevant to our use cases. These characters seem just used as token delimiters, not as special characters that trigger substitutions of their own in regular command-line arguments.

[Note that my original approach (which did not factor Microsoft C ARGV quoting and CMD.EXE quoting into separate functions) would have been able to use quotes more aggressively (i.e., quote something for the C library to unquote because CMD.EXE might interpret it as a meta character, not because the C library might), and thereby guarantee far more easily that none of these contexts are ever reached. But I think we are fine. Can you construct any use case where my proposed current implementation is not already doing the right thing?]

@mgkuhn
Copy link
Contributor Author

mgkuhn commented Nov 8, 2020

Following the cherry-pick merge of escape_raw_string() in #35309, a rebased version of the remaining content of this PR is now at #38352.

@mgkuhn mgkuhn closed this Nov 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants