Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore symlinks under LocalFileSystem root (#2174) #2207

Closed

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #2174
Closes #2206

Rationale for this change

The LocalFileSystem relies on canonicalizing filesystem paths to a URL in order to assign them a consistent key. This logic breaks down when encountering symlinks, as not only can files have multiple paths, but these paths may be outside the prefix of the LocalFileSystem itself. The simplest solution is to just not support them

What changes are included in this PR?

Explicitly ignore any symlinks encountered, which also allows dropping the walkdir dependency as we no longer need protection against filesystem loops caused by soft links. Hard link loops are impossible to handle, with most OSes preventing them.

Are there any user-facing changes?

Symlinks will no longer be followed by LocalFileSystem

@tustvold tustvold added the api-change Changes to the arrow API label Jul 28, 2022
@github-actions github-actions bot added the object-store Object Store Interface label Jul 28, 2022
@alamb
Copy link
Contributor

alamb commented Jul 28, 2022

I need to think more deeply about what not supporting symlinks would mean. I do feel like symlinks are used for many different things locally so simply ignoring them seems less than ideal

@tustvold
Copy link
Contributor Author

tustvold commented Jul 28, 2022

To be completely clear, LocalFileSystem has never properly supported them, all this change does is explicitly not support them, as opposed to the current state of play where they are only effectively not supported 😅

They will continue to work as before in the path to the LocalFileSystem root, we just ignore any under this

@alamb
Copy link
Contributor

alamb commented Jul 28, 2022

I don't understand what the current state of support for LocalFileSystem (as in what is not properly supported)? I could test this out myself I am just being lazy

@tustvold
Copy link
Contributor Author

tustvold commented Jul 28, 2022

#[tokio::test]
    #[cfg(target_family = "unix")]
    async fn test_symlink() {
        let root = TempDir::new().unwrap();
        let integration = LocalFileSystem::new_with_prefix(root.path()).unwrap();

        let subdir = root.path().join("a");
        std::fs::create_dir(&subdir).unwrap();
        let file = subdir.join("file.parquet");
        std::fs::write(&file, "test").unwrap();

        check_list(&integration, &["a/file.parquet"], None).await;

        // Out of tree symlink gets ignored
        let other = NamedTempFile::new().unwrap();
        std::os::unix::fs::symlink(other.path(), root.path().join("test.parquet"))
            .unwrap();
        check_list(&integration, &["a/file.parquet"], None).await;

        // In tree symlink gets resolved and deduplicated
        std::os::unix::fs::symlink(&subdir, root.path().join("b")).unwrap();
        check_list(&integration, &["a/file.parquet"], None).await;

        // In tree symlink gets resolved and canonicalised to actual path
        check_list(&integration, &["a/file.parquet"], Some(&Path::from("b"))).await;

        // Out of tree symlink that loops back is ignored
        let dir = TempDir::new().unwrap();
        let child = root.path().join("child");
        let link = child.join("link");

        std::fs::create_dir(child).unwrap();
        std::os::unix::fs::symlink(dir.path(), &link).unwrap();
        std::os::unix::fs::symlink(&subdir, dir.path().join("back")).unwrap();

        let loopback = root
            .path()
            .join("child")
            .join("link")
            .join("back")
            .join("file.parquet");

        assert!(loopback.exists());

        check_list(&integration, &[], Some(&Path::from("child"))).await;
        check_list(&integration, &[], Some(&Path::from("child/link"))).await;

        // Ignore broken symlink
        std::os::unix::fs::symlink(
            root.path().join("foo.parquet"),
            root.path().join("c"),
        )
        .unwrap();
        check_list(&integration, &["a/file.parquet"], None).await;
    }

To summarise:

  • Symlinks that "escape" the LocalFileSystem root are silently ignored, even if they link back into the tree
  • Symlinks that are within the LocalFileSystem root are resolved to paths, even if these then aren't prefixes of the search path
  • The current behavior is sufficiently counter-intuitive that I think simply ignoring soft links is a better UX

There may be other weirdness going on, I'm actually having a hard time fully understanding what the behaviour is... The above was entirely determined empirically...

@jccampagne
Copy link
Contributor

tustvold:
// In tree symlink gets resolved and deduplicated
// In tree symlink gets resolved and canonicalised to actual path
Symlinks that are within the LocalFileSystem root are resolved to paths, even if these then aren't prefixes of the search path

What's the rationale behind resolving symlinks to actual path, and deduplicating?

alamb: I need to think more deeply about what not supporting symlinks would mean. I do feel like symlinks are used for many different things locally so simply ignoring them seems less than ideal

A use case for symlinks could be a way to organise large data files (eg parquet): instead of moving or copying large datasets, one could organise the datasets by using links under different directories.

eg:

The directory /data/everything/ is a repository containing many (large) files:

/data/everything/a.parquet
…
/data/everything/z.parquet

One would create several subsets:

/data/client1/a.parquet -> /data/everything/a.parquet
/data/client1/b.parquet -> /data/everything/b.parquet


/data/client2/d.parquet -> /data/everything/d.parquet
/data/client2/e.parquet -> /data/everything/e.parquet
/data/client2/f.parquet -> /data/everything/f.parquet

/data/client3/x.parquet -> /data/everything/w.parquet
/data/client3/y.parquet -> /data/everything/w.parquet  # yes, same file, why not?

And we would create a Filestore for each case/client (pseudo code):

LocalFileSystem::new_with_prefix("/data/client1");
LocalFileSystem::new_with_prefix("/data/client2");
LocalFileSystem::new_with_prefix("/data/client3");

And they would see the files as unresolved links: /data/client*/*.parquet instead of /data/everything/*.parquet.

@tustvold
Copy link
Contributor Author

tustvold commented Jul 30, 2022

The rationale for resolving symlinks was that the intent of the crate, at least historically, was to provide object store semantics. LocalFilesystem would then map this to a filesystem but filesystem specific things like relative paths, symlinks, non-ASCII characters, globs, etc... did not need to be supported.

This sidesteps a whole host of gnarly nonsense, e.g. if I delete a file that is a symlink should it actually delete the linked file. If I delete the last file in a directory should it delete the directory, what if there is a symlink to that directory? What about if I perform two concurrent modification requests that resolve to the same underlying filesystem path, etc... It also helps paper over OS-specific quirks, most notably the absolute mess that is filesystem paths on Windows as we just punt to the file URI standard.

I'm definitely not saying the current approach is perfect, but hopefully that gives some background on how things ended up this way? The TLDR was this was the least terrible way I devised to do it, but I'm open to alternative suggestions 😅

@alamb
Copy link
Contributor

alamb commented Jul 31, 2022

I'm definitely not saying the current approach is perfect, but hopefully that gives some background on how things ended up this way?

Thank you @tustvold -- Your rationale and description makes sense to me -- as I don't fully understand the symlink semantics, what I hope to do is to play around with the code in this PR and better understand it. I may not have a chance for a few days however

@alamb
Copy link
Contributor

alamb commented Jul 31, 2022

Also, maybe we can change the title of this PR as "Ignore symlinks in LocalFileSystem" seems somewhat contradictory to your explanation that "They [symlinks] will continue to work as before in the path to the LocalFileSystem root, we just ignore any under this"

I think I may be getting hung up on the implications of the title that the code might not fully reflect

@tustvold tustvold changed the title Ignore symlinks in LocalFileSystem (#2174) Ignore symlinks under LocalFileSystem root (#2174) Jul 31, 2022
@alamb
Copy link
Contributor

alamb commented Aug 1, 2022

I ran a little experiment (code below) with some various symlinks and the TLDR is I don't see any difference in behavior with this PR comared to master

$ ls -l /tmp/object_store/
total 16
-rw-r--r--  1 alamb  wheel   4 Aug  1 10:47 file1.txt
-rw-r--r--  1 alamb  wheel   7 Aug  1 10:47 file2.txt
lrwxr-xr-x  1 alamb  wheel  27 Aug  1 10:48 file_ln.txt -> /tmp/object_store/file1.txt
lrwxr-xr-x  1 alamb  wheel  21 Aug  1 10:50 file_ln_outside.txt -> /tmp/outsize_root.txt

Using the master branch, my test program shows

Using object store root: /tmp/object_store
file2.txt, size: 7, 7
file1.txt, size: 4, 4

Using this PR's branch my test prgram shows:

Using object store root: /tmp/object_store
file1.txt, size: 4, 4
file2.txt, size: 7, 7

Test code:

//! Basic 'ls' client
use std::sync::Arc;

use futures::stream::{FuturesOrdered, StreamExt};
use object_store::{local::LocalFileSystem, path::Path, ObjectStore};

#[tokio::main]
async fn main() {
    // create an ObjectStore
    let object_store: Arc<dyn ObjectStore> = get_local_store();

    // list all objects in the store
    let path: Path = "/".try_into().unwrap();
    let list_stream = object_store
        .list(Some(&path))
        .await
        .expect("Error listing files");

    // List all files in the store
    list_stream
        .map(|meta| async {
            let meta = meta.expect("Error listing");

            // fetch the bytes from object store
            let stream = object_store
                .get(&meta.location)
                .await
                .unwrap()
                .into_stream();

            // Get the size size
            let measured_size = stream
                .map(|bytes| {
                    let bytes = bytes.unwrap();
                    bytes.len()
                })
                .collect::<Vec<usize>>()
                .await
                .into_iter()
                .sum::<usize>();

            (meta, measured_size)
        })
        .collect::<FuturesOrdered<_>>()
        .await
        .collect::<Vec<_>>()
        .await
        .into_iter()
        .for_each(|(meta, measured_size)| {
            println!("{}, size: {}, {}", meta.location, meta.size, measured_size);
        });
}


fn get_local_store() -> Arc<dyn ObjectStore> {
    let root = "/tmp/object_store";
    println!("Using object store root: {}", root);
    let local_fs =
        LocalFileSystem::new_with_prefix(root)
        .expect("Error creating local file system");

    Arc::new(local_fs)
}

@alamb
Copy link
Contributor

alamb commented Aug 1, 2022

Perhaps I am doing something wrong 🤔

@tustvold
Copy link
Contributor Author

tustvold commented Aug 1, 2022

Yes, that is expected, as determined above:

  • Symlinks outside the tree such as "file_ln_outside.txt" are silently ignored
  • Symlinks within the tree such as "file_ln.txt" are deduplicated

In order to see a difference you need to construct a symlink setup that doesn't meet either of these properties.

For example.

$ tree -a /tmp/object_store/
/tmp/object_store/
├── a.txt
└── foo
    └── a.txt -> ../a.txt

With this setup if you list without a prefix you will get ["a.txt"], if you list with a prefix of "foo" you will also get back ["a.txt"] on master, but nothing on this PR (or at least you shouldn't).

That said there is a bug I've just realised, which I'll fix up now, but the behaviour you describe above is "expected". It's incredibly counter-intuitive, but that is part of why I want to fix this 😆

@tustvold tustvold marked this pull request as draft August 1, 2022 15:57
@tustvold
Copy link
Contributor Author

tustvold commented Aug 1, 2022

I'm going to take a stab at properly supporting symlinks and see where it leads me

@tustvold
Copy link
Contributor Author

tustvold commented Aug 1, 2022

Closing as I think #2269 is better, thank you all for pushing me to do this properly 😆

@tustvold tustvold closed this Aug 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API object-store Object Store Interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

object_store: Create explicit test for symlinks Test local::tests::test_list_root fails on main on macos
3 participants