Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safety of Mmap::as_slice ? #25

Open
SimonSapin opened this issue Nov 7, 2016 · 30 comments
Open

Safety of Mmap::as_slice ? #25

SimonSapin opened this issue Nov 7, 2016 · 30 comments

Comments

@SimonSapin
Copy link

The documentation for this method includes:

Unsafety

The caller must ensure that the file is not concurrently modified.

But since the filesystem is shared with other processes that might do anything, this seems very difficult to ensure. (I imagine that an application could be run in a container like Docker to give it a private filesystem?)

So what’s the worst that could happen? https://stackoverflow.com/questions/21286870/how-safe-are-memory-mapped-files-for-reading-input-files seems a bit hand-wavy but suggests: not much.

I’m considering using this method to read (hopefully more efficiently than with File::read) files that are usually not modified, but they might be modified for example when the system’s package manager updates them to a new version.

I do not mind if reading a byte at the same location twice gives different values, or if reading two locations give inconsistent values (because a write has happened in between the two reads). This might cause my program to unexpectedly return Err or (safely) panic, but that’s ok.

I do mind if this is Undefined Behavior of the sort that can cause anything to happen, including potentially being exploited for remote code execution or other fun stuff.

If only the former can happen, should this method really be unsafe?

@BurntSushi
Copy link
Collaborator

See also: #4

@SimonSapin
Copy link
Author

So is the conclusion that file-based mmap is basically impossible to use safely? (Unless you somehow control the entire filesystem.)

This does not match how common this pattern seems to be.

@BurntSushi
Copy link
Collaborator

@SimonSapin I feel similarly as you, but the argument in favor of unsafe is somewhat convincing. In particular, without unsafe one could have an &[u8] that can be mutated in safe code, which generally seems like a violation of what one would expect when given an &[u8].

@danburkert
Copy link
Owner

Yep, @BurntSushi is correct. I wouldn't expect something like a system package manager updating a package to be an issue, since I would expect it to write a new file instead of modifying an existing one, but obviously that's down to the implementation.

So is the conclusion that file-based mmap is basically impossible to use safely?

It's impossible for this library to guarantee that the mmap is being used correctly, but it's not impossible for a combination of the application developer and operator. File locks (even advisory), permissions, containers, etc. can all be used to make sure the invariants aren't broken.

@burdges
Copy link

burdges commented Nov 7, 2016

Apologies, but I'm confused. When would safe code mutate a &[u8] that some unknown function gave it? Wouldn't only a &mut [u8] have that problem? I understand that a function could be marked unsafe to push out the "unsafe boundary" that Niko's blog recent posts discuss in relation to future optimizations, but you'll never know if you're the only owner of a &[u8] given to you by another crate, so that does not seem to apply. What am I missing?

As an aside, an interesting way to address this for new rust code might be another crate wrapping this crate that provides a family of checksums on the file contents and updates them when some reference guard type goes out of scope. You'd want a checksum that was locally mailable in the data in O(1) time, so no cryptographic checksums like polay1305. Also, if the checksum were mailable in both the key and data, then you could mutate the key when mapping the file, so that the current file's checksum key would never exist on disk until you closed it, or maybe wrote it out as part of a commit operation and updated it in memory again. Any processes that want to share the file could share the in-memory key via another IPC channel. All this malleability means a weak checksum, but one could make up for that by making it larger.

@BurntSushi
Copy link
Collaborator

@burdges The &[u8] from a memory map is backed by a file, which means it can be mutated by modifying the file itself.

@burdges
Copy link

burdges commented Nov 7, 2016

I miss-read your statement up thread as saying that marking the function as unsafe changed the compiler's behavior in some more subtle way. In this case, the unsafe just reminds the programmer that providing a safe abstraction requires more work though. Ok

@BurntSushi
Copy link
Collaborator

@burdges Ah I see ya, definitely didn't mean that! Apologies.

@strega-nil
Copy link

Couldn't it return an &[Cell] in order to be safe?

@BurntSushi
Copy link
Collaborator

@ubsan Can you expand more on that suggestion? What is Cell?

@burdges
Copy link

burdges commented Dec 5, 2016

I suppose @ubsan means std::cell::Cell<u8> but to answer the question..

Cell<T> is not Sync. AtomicU8 is Sync, but not stable, and rather expensive. If you want mmap then you likely want a lighter weight courser grained locking mechanism anyways.

@strega-nil
Copy link

Seems reasonable.

@DoumanAsh
Copy link

Wouldn't simply opening File with write access to ensure that no other process could access it? (at least on Windows I think it would give you some guarantees)

@BurntSushi
Copy link
Collaborator

@DoumanAsh "at least on Windows" I think is the key part. :-)

@DoumanAsh
Copy link

Sadly i'm not expert in regard of Linux, but at least there is some capabilities to set lock through (fcntl)[http://man7.org/linux/man-pages/man2/fcntl.2.html]

So i suppose it all comes to OS capabilities.
fcntl can lock through Advisory record locking, but doesn't guarantee safety i think?

Then there is non-posix thing that would work only on Linux
See section Open file description locks (non-POSIX)
Which is stil advisory...

Considering that Mandatory locking is buggy according to above link, there is no way on Linux and therefore no way to guarantee safety on unix\linux platforms

@BurntSushi
Copy link
Collaborator

Right, advisory locking is opt-in. An application itself can use it to provide a form of internally consistent safety, but it can never prevent some other random process from mutating its &[u8].

@dtolnay
Copy link

dtolnay commented Jul 31, 2017

To summarize, are we saying that concurrently modifying a memmap'd file to which Rust holds a &[u8] or &mut [u8] is officially Undefined Behavior, but "probably" will not cause anything bad if you are just reading bytes out?

@strega-nil
Copy link

@dtolnay I mean, yeah? but if you're not just reading bytes out, it'll probably bite you at some point. It's the same issue many old C++ codebases are having with newer compilers that assume you're not doing multithreaded accesses without atomics; this would be equivalent to multithreaded accesses without atomics.

@danburkert
Copy link
Owner

One thing that worries me is that we're changing the API to have a safe deref to &[u8] on Mmap, but make it unsafe to create a Mmap from a file. @ubsan is this going to cause problems with the unsafe code guidelines, since it seems like it may allow the unsafe behavior (accessing shared mmap'd memory) to be far from the unsafe annotation (creating the mmap)?

@strega-nil
Copy link

@danburkert I mean, if you're not doing UB, you're fine ;)

@BurntSushi
Copy link
Collaborator

@ubsan Could you elaborate on that please?

@strega-nil
Copy link

strega-nil commented Aug 19, 2017

@BurntSushi unsafe doesn't affect whether something is UB - at least, I'm really working hard to make that happen. Therefore, the location of the keyword doesn't matter. (and anyways, the kind of UB we're talking about here will never be defined, as it's equivalent to unprotected threaded alias).

Basically, you need to use atomic accesses for every single access to a MAP_SHARED file, if you're going to be changing the backing file. It doesn't matter where the unsafe goes.

s/volatile/atomic, thanks talchas

@iqualfragile
Copy link

Would it be possible to open the file, i.e. have a file-handle, delete the file from the filesystem, check that only one file handle to that file exists by iterating through all open files and then return to the user?
That would probably cause problems if the file is not un-deleted before the program exits.

@danburkert
Copy link
Owner

@iqualfragile what specifically do you mean by un-deleted?

@iqualfragile
Copy link

well, you would have to restore the file to the file system when you close the mmap (on drop probably) otherwise the file is just gone. You got a file descriptor so you can re-add it to the file system.

@danburkert
Copy link
Owner

@iqualfragile what specifically do you mean by re-add it to the file system? I'm not aware of an API to do that (even OS/FS dependent), so I'm intrigued.

If there's no API and it requires writing the entire file back out you may as well not use a mmap at all.

@iqualfragile
Copy link

iqualfragile commented Jun 30, 2018

on linux you can access the list of file descriptors of a process in /proc/pid/fd/descriptor. you can then use ln -L to re-link the file (inode) to a name in the file system tree. doing so does not invoke a copy. i do not know how ln does it though.
so basically: have a file open in a program, optionally delete it, cd to /proc/pid/fd/, ls -lah to find which descriptor you want, ln -L it to where you want it, if you delete it you can restore it to the same name, otherwise you get two hardlinked files.
hope that makes it clear.
thinking about it this solution is probably not as unworkable as i thought when i first suggested it because an unclean shutdown of your program tends to leave an inconsistent state on disk anyway. so it might be ok/better if the file is gone instead.

@iqualfragile
Copy link

doing some further reading this is archived by using the rather new linkat() syscall, passing the AT_SYMLINK_FOLLOW flag.

@KodrAus
Copy link

KodrAus commented Jun 30, 2018

... an unclean shutdown of your program tends to leave an inconsistent state on disk anyway. so it might be ok/better if the file is gone instead.

Hmm, I don't think this an assumption that could really be baked in here. Inconsistent state doesn't necessarily mean irrecoverable state.

@iqualfragile
Copy link

true, but in that case you are building a database and are probably ok with the unsafe interface? Another even more specific option would be to snapshot/cp --reflink the file before deleting on filesystems that support it. Would effectively limit the full solution to work on linux using btrfs or zfs. (i think xfs is working on patches for cow and there are some patches to make ext do cow).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants