-
Notifications
You must be signed in to change notification settings - Fork 849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore symlinks under LocalFileSystem root (#2174) #2207
Ignore symlinks under LocalFileSystem root (#2174) #2207
Conversation
I need to think more deeply about what not supporting symlinks would mean. I do feel like symlinks are used for many different things locally so simply ignoring them seems less than ideal |
To be completely clear, LocalFileSystem has never properly supported them, all this change does is explicitly not support them, as opposed to the current state of play where they are only effectively not supported 😅 They will continue to work as before in the path to the LocalFileSystem root, we just ignore any under this |
I don't understand what the current state of support for LocalFileSystem (as in what is not properly supported)? I could test this out myself I am just being lazy |
To summarise:
There may be other weirdness going on, I'm actually having a hard time fully understanding what the behaviour is... The above was entirely determined empirically... |
What's the rationale behind resolving symlinks to actual path, and deduplicating?
A use case for symlinks could be a way to organise large data files (eg parquet): instead of moving or copying large datasets, one could organise the datasets by using links under different directories. eg: The directory
One would create several subsets:
And we would create a Filestore for each case/client (pseudo code):
And they would see the files as unresolved links: |
The rationale for resolving symlinks was that the intent of the crate, at least historically, was to provide object store semantics. LocalFilesystem would then map this to a filesystem but filesystem specific things like relative paths, symlinks, non-ASCII characters, globs, etc... did not need to be supported. This sidesteps a whole host of gnarly nonsense, e.g. if I delete a file that is a symlink should it actually delete the linked file. If I delete the last file in a directory should it delete the directory, what if there is a symlink to that directory? What about if I perform two concurrent modification requests that resolve to the same underlying filesystem path, etc... It also helps paper over OS-specific quirks, most notably the absolute mess that is filesystem paths on Windows as we just punt to the file URI standard. I'm definitely not saying the current approach is perfect, but hopefully that gives some background on how things ended up this way? The TLDR was this was the least terrible way I devised to do it, but I'm open to alternative suggestions 😅 |
Thank you @tustvold -- Your rationale and description makes sense to me -- as I don't fully understand the symlink semantics, what I hope to do is to play around with the code in this PR and better understand it. I may not have a chance for a few days however |
Also, maybe we can change the title of this PR as "Ignore symlinks in LocalFileSystem" seems somewhat contradictory to your explanation that "They [symlinks] will continue to work as before in the path to the LocalFileSystem root, we just ignore any under this" I think I may be getting hung up on the implications of the title that the code might not fully reflect |
I ran a little experiment (code below) with some various symlinks and the TLDR is I don't see any difference in behavior with this PR comared to master $ ls -l /tmp/object_store/
total 16
-rw-r--r-- 1 alamb wheel 4 Aug 1 10:47 file1.txt
-rw-r--r-- 1 alamb wheel 7 Aug 1 10:47 file2.txt
lrwxr-xr-x 1 alamb wheel 27 Aug 1 10:48 file_ln.txt -> /tmp/object_store/file1.txt
lrwxr-xr-x 1 alamb wheel 21 Aug 1 10:50 file_ln_outside.txt -> /tmp/outsize_root.txt Using the Using object store root: /tmp/object_store
file2.txt, size: 7, 7
file1.txt, size: 4, 4 Using this PR's branch my test prgram shows:
Test code: //! Basic 'ls' client
use std::sync::Arc;
use futures::stream::{FuturesOrdered, StreamExt};
use object_store::{local::LocalFileSystem, path::Path, ObjectStore};
#[tokio::main]
async fn main() {
// create an ObjectStore
let object_store: Arc<dyn ObjectStore> = get_local_store();
// list all objects in the store
let path: Path = "/".try_into().unwrap();
let list_stream = object_store
.list(Some(&path))
.await
.expect("Error listing files");
// List all files in the store
list_stream
.map(|meta| async {
let meta = meta.expect("Error listing");
// fetch the bytes from object store
let stream = object_store
.get(&meta.location)
.await
.unwrap()
.into_stream();
// Get the size size
let measured_size = stream
.map(|bytes| {
let bytes = bytes.unwrap();
bytes.len()
})
.collect::<Vec<usize>>()
.await
.into_iter()
.sum::<usize>();
(meta, measured_size)
})
.collect::<FuturesOrdered<_>>()
.await
.collect::<Vec<_>>()
.await
.into_iter()
.for_each(|(meta, measured_size)| {
println!("{}, size: {}, {}", meta.location, meta.size, measured_size);
});
}
fn get_local_store() -> Arc<dyn ObjectStore> {
let root = "/tmp/object_store";
println!("Using object store root: {}", root);
let local_fs =
LocalFileSystem::new_with_prefix(root)
.expect("Error creating local file system");
Arc::new(local_fs)
}
|
Perhaps I am doing something wrong 🤔 |
Yes, that is expected, as determined above:
In order to see a difference you need to construct a symlink setup that doesn't meet either of these properties. For example.
With this setup if you list without a prefix you will get That said there is a bug I've just realised, which I'll fix up now, but the behaviour you describe above is "expected". It's incredibly counter-intuitive, but that is part of why I want to fix this 😆 |
I'm going to take a stab at properly supporting symlinks and see where it leads me |
Closing as I think #2269 is better, thank you all for pushing me to do this properly 😆 |
Which issue does this PR close?
Closes #2174
Closes #2206
Rationale for this change
The LocalFileSystem relies on canonicalizing filesystem paths to a URL in order to assign them a consistent key. This logic breaks down when encountering symlinks, as not only can files have multiple paths, but these paths may be outside the prefix of the LocalFileSystem itself. The simplest solution is to just not support them
What changes are included in this PR?
Explicitly ignore any symlinks encountered, which also allows dropping the walkdir dependency as we no longer need protection against filesystem loops caused by soft links. Hard link loops are impossible to handle, with most OSes preventing them.
Are there any user-facing changes?
Symlinks will no longer be followed by LocalFileSystem