Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Ambiguity in the File namespace API #14245

Open
straight-shoota opened this issue Jan 16, 2024 · 12 comments
Open

[RFC] Ambiguity in the File namespace API #14245

straight-shoota opened this issue Jan 16, 2024 · 12 comments

Comments

@straight-shoota
Copy link
Member

straight-shoota commented Jan 16, 2024

Motivated by #13849, I started looking at uses of File.file? and similar methods. That issue was about an inadequate use of that method in the compiler. Sadly, it doesn't look great on other call sites as well. The problem is systematic and asks for improvements on the API level.

Introduction

File.file? seems like a reasonable choice for checking if a path is a file. But it depends on the understanding of "file". This methods considers an (ordinary) file in contrast to other file types like directory, special device file, named pipe. This detail is often not terribly relevant. Most use cases shouldn't worry too much about the exact type.
But often doesn't mean always and rare usage of some of these types makes it easy to miss any shortcomings on them.

A typical intention is to verify that File.read or similar would work on the path. But that's not what File.file? does. It's focused on entry types in the file system. Named pipes or character devices work for File.read as well and their exclusion is often unreasonable and unintended.

File.exists? is confusing because File.exists?(path) suggests it's checking if a file exists. But this method checks if anything exists at the path. It doesn't need to be a file. It can be a directory.

It get's even more confusing. The following example seems legit and safe. But in reality, it doesn't make any sense:

if File.readable?(path)
  File.read(path)
end

Despite the similar name suggesting a relation, File.readable? has nothing to do with File.read.
File.readable? is typically true for a directory1. But File.read does not work on directories.

I've mentioned these shortcomings in my talk at CrystalConf 2023. This issue however is a more complete and organized collection on this topic.

Investigation

I looked through stdlib and compiler sources for call sites of some related methods in the File namespace.

Some of the findings from combing through stdlib and compiler sources

The most severe faults that I noticed, have already been fixed:

Some examples for lesser issues where there was lesser need to fix them directly:

when File.file?(command)

opts.stop if File.file?(arg)

These are similar to #13849 (same location) and should actually check if the path points to something that we can practically read from.

There are many more similar instances in the compiler that I'm skipping for brevity. They're usually related to loading source code or cached data from files. In some cases it might be fine to use File.file?, but in others the same limitations as above apply.

if ::File.file?(LOCALTIME) && ::File.readable?(LOCALTIME)

This goes the extra step of also checking File.readable?. But its still limited to ordinary files.

return source if File.exists?(path) && File.file?(path) && File.readable?(path)

This goes another step extra and over the top. File.exists? is redundant to File.file?.


There are plenty uses of these methods that are dramatically wrong, others not that much. For some of these methods, their semantics rarely match the apparent intention at the callsite. I barely found any instance where I'm confident that the method corresponds exactly to the intended purpose.

It seems very clear that there are many issues with utilizing these API methods in the standard library and compiler. It's not hard to extrapolate that this would apply to other code bases as well. We can't even get it right in the stdlib!

An easy solution would be if people just read the documentation. It clearly says what these methods do!
But the sheer scale of inapproriate usage clearly points to intrinsical flaws. It's hardly possible to use these methods correctly for the most common use cases

It's not that they're completely useless, they have their purpose. But those are specific niche use cases.
You won't need them unless your Crystal program explicitly deals with file system details, for example.

Analysis

I don't think any of these individual methods are wrong in some sense.
The underlying problem is an ambiguous mixture of different concepts in the same namespace.
Some methods are tightly coupled to the tiny details of the file system and do not match the scope of higher-level, usage-oriented APIs.

Different understandigns of the term "file" in methods from the File namespace:

  • File.file?: only an ordinary file (excluding named pipes and character devices)
  • File.read: ordinary files, named pipe and character device
  • File.exists?: any kind of entry in the file system, including directories

Having that all crowded together in the prominent top-level File namespace seems to be the cause for most misunderstandings.

Discussion

I believe we should separate methods with different scopes to express their differences more clearly.

  1. Lower-level methods using the narrower definitions (e.g. .readable?, .writable? and .executable?) could fit into File::Info or File::Type. Their use cases are rare and there's not much reason to have them in a high-level namespace.

  2. The File type is focused on the concept of a stream that can read and write. So the File class namespace should be reserved for the core methods following that definition.
    There could be a set of methods to check wether a read or write operation would principally work on a given path. They could be named File.can_read? and File.can_write? (.readable? and .writable? already have a different meaning and even if we removed them, we couldn't immediately replace them with a different method).

  3. I'm not sure what to do about the wider scope referring to any entry in the file system. There's no apparent target for that. Maybe we should consider a new FileSystem namspace? Path could be an option, but it has so far been understood in an abstract sense, so we might just be heading for another mixture of concerns.
    I think most methods of this scope are not even that bad to mix in the File namespace. Except for .exists? I've found little cause for confusion. This might need some more investigation, but perhaps we could leave them in the File namespace and just rename File.exists? to a less ambiguous name.

Footnotes

  1. The readable bit is typically set on directories because that enables listing its entries.

@apainintheneck
Copy link
Contributor

I was curious after reading this so I took a quick look at what other non-Ruby languages do for these types of file APIs.

Interestingly enough it seems like Python, Go and C++ lack a high-level file permissions API like Crystal has currently with File.readable?, File.writable? and File.executable?. Basically the idea is that if you really need file permissions info you can go the some sort of lower level API like stat. This seems similar to what Crystal already does with File::Info. Python and Go even seem to suggest just assuming that files have the permissions you expect and just handle the errors as they pop up.

The File.exists? method is interesting because it's not consistently implemented for different programming languages. Ruby, Crystal, C++ and Python all have it while Go and Node don't have it. In fact, Node used to have it but it got deprecated at some point (see docs).

Looking over the documentation for a few languages made me realize that the Crystal docs tend to not clarify the difference between file system objects and individual file types very well.


File.exists?

Ruby

Return true if the named file exists.

file_name can be an IO object.

“file exists” means that stat() or fstat() system call is successful.

Python

Return True if path refers to an existing path or an open file descriptor. Returns False for broken symbolic links. On some platforms, this function may return False if permission is not granted to execute os.stat() on the requested file, even if the path physically exists.

Crystal

Returns whether the file given by path exists.

Symbolic links are dereferenced, possibly recursively. Returns false if a symbolic link refers to a non-existent file.

File.file?

Ruby

Returns true if the named file exists and is a regular file.

file can be an IO object.

If the file argument is a symbolic link, it will resolve the symbolic link and use the file referenced by the link.

Crystal

Returns true if given path exists and is a file.

@straight-shoota
Copy link
Member Author

it seems like Python, Go and C++ lack a high-level file permissions API like Crystal has currently with File.readable?, File.writable? and File.executable?. Basically the idea is that if you really need file permissions info you can go the some sort of lower level API like stat. This seems similar to what Crystal already does with File::Info.

Yeah, I think that's quite reasonable considering that valid use cases for these methods are quite rare.
Note that stat does not have this level of information. It only has the basic file permissions (not ACL or correlation with user/group). Thus File::Info in Crystal (which is based on stat) currently does not entail this either.

Python and Go even seem to suggest just assuming that files have the permissions you expect and just handle the errors as they pop up.

That's probably a good choice in many use cases. Even if you do every possible check beforehand, File.read may still fail for various reasons. Maybe limitations in the file system driver. So the only way you can reliably know whether File.read will work is actually running it. Dealing with errors is quite easy, you just have to handle whatever IO::Error the method raises. There's usually no need to address different error conditions explicitly.

@dsisnero
Copy link

When doing things with actual files and directories, I like being able to build it up with Path

dir = Path['mydir']
file = dir / 'myfile'

I wish, like ruby Pathname we had the file? and directory? methods on Path and maybe even mkdir on dir Path's

Dir.mkdir dir unless dir.directory?
or
dir.mkdir_p unless dir.directory?

@straight-shoota
Copy link
Member Author

@dsisnero I think enhancements to Path are really not within the scope of this issue.
There has been a bit of a debate about whether Path should be aware of any actual file system information or stay purely abstract. It would probably be best to start a discussion about this on https://forum.crystal-lang.org
Thanks.

@ysbaddaden
Copy link
Contributor

@straight-shoota I think extending Path to become concrete is in scope of this issue.

For example Path.exists? wouldn't be ambiguous anymore. Other methods would make sense on Path (both as class and instance methods). We can also use the opportunity to lift some ambiguities (e.g. .file?).

@straight-shoota
Copy link
Member Author

straight-shoota commented Mar 12, 2024

This is certainly related but IMO it should still be a separate discussion. The outcome affects parts of this issue, of course. But it's not fundamental, it just offers one potential solution for one of the aspects (and it's the least important one). The more critical issues of this topic do not depend on the Path discussion.

@ysbaddaden
Copy link
Contributor

From the above analysis:

  1. I'd be fine dropping readable?, .writable? and .executable? altogether, but maybe someone would like them (e.g. to implement ls) so I guess we could move them to File::Info since permissions are part of the file metadata (and not the actual data).
  2. You said there aren't reliable ways to know whether we really can actually read, write or execute a file because of ACL and others, let's avoid this.
  3. IMO Path is the way forward, it's already not fully abstract anyway: Path.new is OS dependent, Path#expand supports the home directory, ...

@straight-shoota
Copy link
Member Author

straight-shoota commented Mar 12, 2024

You said there aren't reliable ways to know whether we really can actually read, write or execute a file because of ACL and others, let's avoid this.

There's no reliably way to know for sure. But there are simple ways to tell that you can't. And I think this is actually quite valuable for a number of use cases (I list some examples in the detailed investigation report).

It's better to do a basic sanity check before the read/write/exec than just perform it and then having to deal with basic errors. Compare the following examples for optionally reading from a file if it exists.

begin
  File.read(path)
rescue exc : FileNotFoundError
  # ignore error if file doesn't exist
end
if File.readable?(path) && !File.directory?(path)
  File.read(path)
end

Note: The examples don't do exactly the same thing and there are more error cases not covered for brevity. This fact also exemplifies that it's a complex topic that needs attention to get the semantics right.

I think the latter form is more ergonomical and efficient. However the condition (readable &&! directory) isn't very straightforward. So UX could be improved by providing direct predicate methods for this kind of test:

if File.can_read?(path)
  File.read(path)
end

UPDATE: Note that this form introduces a race condition. The file could be deleted between checking the condition and executing the actual read. So it might still be necessary to implement proper error handling.

@ysbaddaden
Copy link
Contributor

I'd suggest File.can?(permissions, path) where permissions is a flag enum.

@straight-shoota
Copy link
Member Author

Hm. I see how that could be useful. But I think it's not as easy to use and could be confusing (there's already File::Permissions enum, for example).
If you need more detailed information, it's probably better to employ the lower level File::Info API directly.
My idea with .can_read? etc. is to be easy to discover and convenient for the most common use cases.

@straight-shoota
Copy link
Member Author

straight-shoota commented Apr 13, 2024

An alternative or supplement to query methods such as .can_read? could be having alternatives for the methods that would be guarded by them, which handle these error cases internally and ignore them. This would result in a most trivial API and isn't succeptible to race conditions.

For example, File.read? would return nil if the file does not exist.

It would be the same concept as File.info and File.info?, File.delete and File.delete?.

An issue with this approach is that it's not possible to easily distinguish between a path not existing at all or being of a wrong kind (like a directory). But I'm not sure if that's actually that much relevant. I presume if you want to read from a specific file name, it's okay if it does not exist, but if it's something entirely different, that's probably a odd situation that needs attention.

New methods related to reading and writing: .each_line?, .open?, .read?, .read_lines?, .write?.

@crysbot
Copy link
Collaborator

crysbot commented Apr 23, 2024

This issue has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/add-file-info-methods-to-path/6781/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants