-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement path validations #1740
Conversation
29a6760
to
4b1c6d0
Compare
Per my proposal in ipfs#1710: - Paths must be valid UTF-8 per RFC 3629. - Paths may not contain ASCII/Unicode C0 control characters (U+0000-U+001F). - Paths may not contain ASCII DEL (U+007F). - Paths are delimited by `/` (U+002F), and therefore path segments may not contain it. - Path segments may contain up to 255 Unicode codepoints. Total path length remains unbounded. - Path segments may not be empty, so that `foo//bar` can mean `foo/bar`, as in POSIX. - Path segments must not be `.` and `..`, so that these can mean what they do in POSIX. Paths may contain any sequence of Unicode codepoints that are not otherwise prohibited. This includes many things that could prove problematic; see path/validation_test.go +121 for some examples. License: MIT
4b1c6d0
to
b8b0870
Compare
|
||
func TestPathCleaning(t *testing.T) { | ||
cases := map[string]string{ | ||
// .. should strip the preceding path segment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not in favor of special casing the names of links like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same logic as POSIX pathname resolution and relative URI resolution.
User-specified objects named .
and ..
would be un-addressable in most contexts already, and even if they were addressable in some contexts, it would be very confusing if they had a different meaning than in POSIX and in URIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be pushed to ipfs/go-ipld?
—
Sent from Mailbox
On Mon, Sep 21, 2015 at 6:16 PM, Will Glynn [email protected]
wrote:
@@ -28,3 +28,27 @@ func TestPathParsing(t *testing.T) {
}
}
}
+
+func TestPathCleaning(t *testing.T) {
- cases := map[string]string{
// .. should strip the preceding path segment
This is the same logic as POSIX pathname resolution and relative URI resolution.
Reply to this email directly or view it on GitHub:
User-specified objects named.
and..
would be un-addressable in most contexts already, and even if they were addressable in some contexts, it would be very confusing if they had a different meaning than in POSIX and in URIs.
https://github.com/ipfs/go-ipfs/pull/1740/files#r40046047
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see similar code being useful in go-ipld, but I think github.com/ipfs/go-ipfs/path
needs to resolve .
and ..
references on parse.
This is required for URIs – even absolute URIs – so web servers including Go's net/http
perform dot-related normalization automatically. This means that go-ipfs
will generally not see paths including .
and ..
when referenced via a URI, whether http:
or fs:
or file:
, rendering these paths equivalent for most users already. POSIX expects .
and ..
to have specific meanings which can be satisfied using this logic. Rob Pike's Lexical File Names in Plan 9 or Getting Dot-Dot Right is also applicable to IPFS, describing detailed rationale for treating Plan 9's non-hierarchical paths in this manner, stating in part:
The ability to construct union directories and other intricate naming structures introduces some thorny problems: as with symbolic links, the name space is no longer hierarchical, files and directories can have multiple names, and the meaning of
..
, the parent directory, can be ambiguous.The meaning of
..
is straightforward if the directory is in a locally hierarchical part of the name space, but if we ask what..
should identify when the current directory is a mount point or union directory or multiply symlinked spot (which we will henceforth call just a mount point, for brevity), there is no obvious answer. Name spaces have been part of Plan 9 from the beginning, but the definition of..
has changed several times as we grappled with this issue.
…
Frustrated by this situation, and eager to have better-defined names for some of the applications described later in this paper, we recently proposed the following definition for..
:
The parent of a directoryX
,X/..
, is the same directory that would obtain if we instead accessed the directory named by stripping away the last path name element ofX
.
Why should string /ipfs/base58hash/foo/../bar
not be immediately resolved to Path /ipfs/base58hash/bar
? Are we entertaining the idea that it could have any other interpretation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed @willglynn -- and agreed with plan9 folks:
The parent of a directory
X
,X/..
, is the same directory that would obtain if we instead accessed the directory named by stripping away the last path name element ofX
.
So this basically means that ..
is relative to the access path, correct? (this makes perfect sense to me)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@whyrusleeping what reservations do you have re .
and ..
matching unix/posix and the web expectations?
return err | ||
} | ||
} | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe remove this commented out stuff, as it is part of ParsePath
?
@willglynn what about these:
wonder if there's some plan9 code for this that we can leverage. i imagine they ran through the same problems. (i emailed them btw. rsc still strongly agrees with our UTF-8 convictions.) |
I think excluding C0 (U+0000-U+001F, U+007F) is critical because of the volume of (mostly C) code that treats UTF-8 as if it were ASCII. Excluding C1 (U+0080-U+009F) seems less critical. They're Unicode codepoints that most things permit without exploding. (See UTF-EBCDIC for the major exception.) There's a case for excluding them – they are control characters, and control characters probably don't belong in paths – but each thing we exclude is another rule, and… well, rules get heavy. I made this list and pushed early to prompt discussion with executable code. They're all things that maybe shouldn't be permitted, but that would be permitted by the rules in the accompanying commit message. Combining marks tend to be paired, but they are legal codepoints in isolation. I think we should permit that, however weird it might be, mostly because rules to prevent that case seem like they'd have to be too smart for our own good. Whitespace-related rules are a slippery slope. Prohibiting leading and trailing whitespace seems reasonable, but then why not interior whitespace too? Should multiple spaces be collapsed? Does this apply to all visually-blank characters, or just space? Which codepoints are considered whitespace, and is that context-dependent? (After all, the zero-width joiner exists because it actually does make a difference in certain languages.) It might be possible to formulate precise rules that accomplish a certain goal relating to whitespace, but let's start by defining that goal. The zero-width joiner is fun because Bidirectional text is a can of worms. I favor permitting bidi-related codepoints, largely because anything that understands them already has to deal with loads of complexity, much of which is described in annex 9. There are many other characters that affect text flow besides the one in the test case, and copy/pasting fragments of bidi text already causes most of these problems, so I don't think IPFS needs to be smart here. Private use characters are legal for interchange, and they tend not to be prohibited by data formats. They don't mean anything, but… use them if you like. That seems like a reasonable position. Prohibiting them would allow downstream applications to use private use characters for internal purposes, which would also be a reasonable position, but this would require another rule. Non-characters maybe should be excluded. There's no rule excluding them because I wasn't sure it's necessary, and keeping the rules short seems valuable. However, some of them used to be illegal for interchange, and some of them can be confusing when mapped to UTF-16, so this is probably a bigger deal than the C1 control codes causing problems in UTF-EBCDIC. If we're going to exclude one more class of things, I'd nominate excluding non-characters first. |
Yeah, i'd vote for excluding them still.
No comment-- do not understand all the implications well enough.
I think plenty of things do this though so it's not rare/unexpected.
Oooooof. no idea here. What i do know is that paths represent critical information to people and being able to recognize URLs matters. Imagine UTF-8 domain names where the security depends on recognition of a name... This may be fraught, but i'd love to see how others have tackled the problem first.
I don't think this is tenable, and worry about the implications. I don't think it's tenable because paths will continue to represent security implications. Today, name recognition is still responsible for stopping sophisticated phishing attacks.
No comment-- do not understand all the implications well enough.
I think it's fine to use them.
Sounds good to me. let's exclude them. I think our best bet is to look at what plan9 did. I'm sure they had something very good and sane. The source is available here: https://github.com/wangeguo/plan9 but i haven't found the path parsing/validating. |
IRIs are defined by RFC 3987 and addresses many of the concerns in this problem space. IRIs require normalization and specifies special handling for bidi text. This processing is more complex than the Unicode awareness in either HFS+ and NTFS, but might be workable in this context. IDNs have spoofing problems despite careful attention being paid to that topic. Current references are RFC 5890, 5891, 5892, 5893. Also, just like HFS+ had to rewrite names with a Another reference that might be useful is Unicode Technical Report 36, Unicode Security Considerations. More generally, there's tremendous conflict between "try not to get in the way" and "try to keep users from getting harmed by names that look similar but are different under the hood". Given the focus on content-addressability and general acceptance of names that aren't useful except by following links or copy/pasting, I think trying not to get in the way is the better tradeoff. |
I don't think we're close at all to security not mattering in visually-confirmed {paths, names, logos}. even when a PKI is employed, still have to visually recognize {paths, names, logos}. 😞 |
Hmm. IPFS paths begin with a multihash, which is both base58 (thus shrugging off all Unicode concerns) and difficult to spoof (i.e. to create a multihash that appears similar to an existing one). An individual publisher has full control and authority to publish anything, malicious or benign, within the For the case of I don't understand the threat you have in mind. Are you suggesting that the entire IPFS namespace needs the same kind of protection as domain names, even in user-controlled (i.e. attacker-controlled) parts of the hierarchy? |
👍 to will's points above.
yep!
In this case I'm worried about names like
where there's some application you're accessing, and it gives users On Sunday, October 4, 2015, Will Glynn [email protected] wrote:
|
Perhaps the path-related protections should be made pluggable, then. We agree that Still, I don't see that this concern belongs in the global namespace or in the definition of an IPFS path. |
fair point |
closing due to inactivity, please reopen as needed. |
I'm unclear on what needs to happen to get this (or something like it) merged. |
Is @willglynn 's proposal the current link name standard of IPFS, now, or has something changed later? It is a quit important information to have during the construction of a project is based on IPFS. |
Per my proposal in #1710:
(
U+0000
-U+001F
).U+007F
)./
(U+002F), and therefore path segments may notcontain it.
length remains unbounded.
foo//bar
can meanfoo/bar
, asin POSIX.
.
and..
, so that these can mean what theydo in POSIX.
Paths may contain any sequence of Unicode codepoints that are not otherwise prohibited. This includes many things that could prove problematic; see path/validation_test.go +121 for some examples.