Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC-112: Path Normalization #112

Merged
merged 8 commits into from
Mar 8, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 79 additions & 0 deletions docs/rfcs/0112-path-normalization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
- Proposal Name: `path-normalization`
- Start Date: 2022-03-08
- RFC PR: [datafuselabs/opendal#112](https://github.com/datafuselabs/opendal/pull/112)
- Tracking Issue: [datafuselabs/opendal#112](https://github.com/datafuselabs/opendal/issues/112)

# Summary

Implement path normalization to enhance user experience.

# Motivation

OpenDAL's current path behavior makes users confused:

- [operator.object("/admin/data/") error](https://github.com/datafuselabs/opendal/issues/107)
- [Read /admin/data//ontime_200.csv return empty](https://github.com/datafuselabs/opendal/issues/109)

They are different bugs that reflect the exact root cause: the path is not well normalized.

On local fs, we can read the same path with different path: `abc/def/../def`, `abc/def`, `abc//def`, `abc/./def`.

There is no magic here: our stdlib does the dirty job. For example:

- [std::path::PathBuf::canonicalize](https://doc.rust-lang.org/std/path/struct.PathBuf.html#method.canonicalize): Returns the canonical, absolute form of the path with all intermediate components normalized and symbolic links resolved.
- [std::path::PathBuf::components](https://doc.rust-lang.org/std/path/struct.PathBuf.html#method.components): Produces an iterator over the Components of the path. When parsing the path, there is a small amount of normalization...

But for s3 alike storage system, there's no such helpers: `abc/def/../def`, `abc/def`, `abc//def`, `abc/./def` refers entirely different objects. So users may confuse why I can't get the object with this path.

So OpenDAL needs to implement path normalization to enhance the user experience.

# Guide-level explanation

We will do path normalization automatically.

The following rules will be applied (so far):

- Remove `//` inside path: `op.object("abc/def")` and `op.object("abc//def")` will resolve to the same object.
- Make sure path under `root`: `op.object("/abc")` and `op.object("abc")` will resolve to the same object.

Other rules still need more consideration to leave them for the future.

# Reference-level explanation

We will build the absolute path via `{root}/{path}` and replace all `//` into `/` instead.

# Drawbacks

None

# Rationale and alternatives

## How about the link?

If we build an actual path via `{root}/{path}`, the link object may be inaccessible.

I don't have good ideas so far. Maybe we can add a new flag to control the link behavior. For now, there's no feature request for link support.

Let's leave for the future to resolve.

## S3 URI Clean

For s3, `abc//def` is different from `abc/def` indeed. To make it possible to access not normalized path, we can provide a new flag for the builder:

```rust
let builder = Backend::build().disable_path_normalization()
```

In this way, the user can control the path more precisely.

# Prior art

None

# Unresolved questions

None

# Future possibilities

None
61 changes: 44 additions & 17 deletions src/services/fs.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
// See the License for the specific language governing permissions and
// limitations under the License.

use std::collections::HashMap;
use std::fs;
use std::io::SeekFrom;
use std::path::PathBuf;
Expand Down Expand Up @@ -59,7 +60,19 @@ impl Builder {

pub async fn finish(&mut self) -> Result<Arc<dyn Accessor>> {
// Make `/` as the default of root.
let root = self.root.clone().unwrap_or_else(|| "/".to_string());
let root = match &self.root {
None => "/".to_string(),
Some(v) => {
if !v.starts_with('/') {
return Err(Error::Backend {
kind: Kind::BackendConfigurationInvalid,
context: HashMap::from([("root".to_string(), v.clone())]),
source: anyhow!("Root must start with /"),
});
}
v.to_string()
}
};

// If root dir is not exist, we must create it.
let metadata_root = root.clone();
Expand Down Expand Up @@ -93,24 +106,39 @@ impl Backend {
pub fn build() -> Builder {
Builder::default()
}

pub(crate) fn get_abs_path(&self, path: &str) -> String {
// Joining an absolute path replaces the existing path, we need to
// normalize it before.
let path = path
.split('/')
.filter(|v| !v.is_empty())
.collect::<Vec<&str>>()
.join("/");

PathBuf::from(&self.root)
.join(path)
.to_string_lossy()
.to_string()
}
}

#[async_trait]
impl Accessor for Backend {
async fn read(&self, args: &OpRead) -> Result<BoxedAsyncReader> {
let path = PathBuf::from(&self.root).join(&args.path);
let path = self.get_abs_path(&args.path);

let open_path = path.clone();
let f = unblock(|| fs::OpenOptions::new().read(true).open(open_path))
.await
.map_err(|e| parse_io_error(e, "read", &path.to_string_lossy()))?;
.map_err(|e| parse_io_error(e, "read", &path))?;

let mut f = Unblock::new(f);

if let Some(offset) = args.offset {
f.seek(SeekFrom::Start(offset))
.await
.map_err(|e| parse_io_error(e, "read", &path.to_string_lossy()))?;
.map_err(|e| parse_io_error(e, "read", &path))?;
};

let r: BoxedAsyncReader = match args.size {
Expand All @@ -122,17 +150,17 @@ impl Accessor for Backend {
}

async fn write(&self, mut r: BoxedAsyncReader, args: &OpWrite) -> Result<usize> {
let path = PathBuf::from(&self.root).join(&args.path);
let path = self.get_abs_path(&args.path);

// Create dir before write path.
//
// TODO(xuanwo): There are many works to do here:
// - Is it safe to create dir concurrently?
// - Do we need to extract this logic as new util functions?
// - Is it better to check the parent dir exists before call mkdir?
let parent = path
let parent = PathBuf::from(&path)
.parent()
.ok_or_else(|| anyhow!("malformed path: {:?}", path.to_str()))?
.ok_or_else(|| anyhow!("malformed path: {:?}", &path))?
.to_path_buf();

let capture_parent = parent.clone();
Expand All @@ -148,33 +176,33 @@ impl Accessor for Backend {
.open(capture_path)
})
.await
.map_err(|e| parse_io_error(e, "write", &path.to_string_lossy()))?;
.map_err(|e| parse_io_error(e, "write", &path))?;

let mut f = Unblock::new(f);

// TODO: we should respect the input size.
let s = io::copy(&mut r, &mut f)
.await
.map_err(|e| parse_io_error(e, "write", &path.to_string_lossy()))?;
.map_err(|e| parse_io_error(e, "write", &path))?;

// `std::fs::File`'s errors detected on closing are ignored by
// the implementation of Drop.
// So we need to call `flush` to make sure all data have been flushed
// to fs successfully.
f.flush()
.await
.map_err(|e| parse_io_error(e, "write", &path.to_string_lossy()))?;
.map_err(|e| parse_io_error(e, "write", &path))?;

Ok(s as usize)
}

async fn stat(&self, args: &OpStat) -> Result<Metadata> {
let path = PathBuf::from(&self.root).join(&args.path);
let path = self.get_abs_path(&args.path);

let capture_path = path.clone();
let meta = unblock(|| fs::metadata(capture_path))
.await
.map_err(|e| parse_io_error(e, "stat", &path.to_string_lossy()))?;
.map_err(|e| parse_io_error(e, "stat", &path))?;

let mut m = Metadata::default();
m.set_path(&args.path);
Expand All @@ -191,7 +219,7 @@ impl Accessor for Backend {
}

async fn delete(&self, args: &OpDelete) -> Result<()> {
let path = PathBuf::from(&self.root).join(&args.path);
let path = self.get_abs_path(&args.path);

let capture_path = path.clone();
// PathBuf.is_dir() is not free, call metadata directly instead.
Expand All @@ -214,15 +242,14 @@ impl Accessor for Backend {
unblock(|| fs::remove_file(capture_path)).await
};

f.map_err(|e| parse_io_error(e, "delete", &path.to_string_lossy()))
f.map_err(|e| parse_io_error(e, "delete", &path))
}

async fn list(&self, args: &OpList) -> Result<BoxedObjectStream> {
let path = PathBuf::from(&self.root).join(&args.path);
let path = self.get_abs_path(&args.path);

let open_path = path.clone();
let f = fs::read_dir(open_path)
.map_err(|e| parse_io_error(e, "read", &path.to_string_lossy()))?;
let f = fs::read_dir(open_path).map_err(|e| parse_io_error(e, "read", &path))?;

let rd = Readdir {
acc: Arc::new(self.clone()),
Expand Down
100 changes: 73 additions & 27 deletions src/services/s3/backend.rs
Original file line number Diff line number Diff line change
Expand Up @@ -121,13 +121,14 @@ impl Builder {
// Use "/" as root if user not specified.
None => "/".to_string(),
Some(v) => {
debug_assert!(!v.is_empty(), "root must not be empty");

// If root not startswith or endswith "/", we should append it.
let prepend = if v.starts_with('/') { "" } else { "/" };
let append = if v.ends_with('/') { "" } else { "/" };

format!("{prepend}{v}{append}")
let mut v = Backend::normalize_path(v);
if !v.starts_with('/') {
v.insert(0, '/');
}
if !v.ends_with('/') {
v.push('/')
}
v
}
};

Expand Down Expand Up @@ -315,25 +316,44 @@ impl Backend {
self.client.clone()
}

/// get_abs_path will return the absolute path of the given path in the s3 format.
/// If user input an absolute path, we will return it as it is with the prefix `/` striped.
/// If user input a relative path, we will calculate the absolute path with the root.
pub(crate) fn get_abs_path(&self, path: &str) -> String {
if path.starts_with('/') {
return path.strip_prefix('/').unwrap().to_string();
// normalize_path removes all internal `//` inside path.
pub(crate) fn normalize_path(path: &str) -> String {
let has_trailing = path.ends_with('/');

let mut p = path
.split('/')
.filter(|v| !v.is_empty())
.collect::<Vec<&str>>()
.join("/");

if has_trailing && !p.eq("/") {
p.push('/')
}

let abs_path = format!("{}{}", self.root, path);
return abs_path.strip_prefix('/').unwrap().to_string();
p
}

/// get_abs_path will return the absolute path of the given path in the s3 format.
///
/// Read [RFC-112](https://github.com/datafuselabs/opendal/pull/112) for more details.
pub(crate) fn get_abs_path(&self, path: &str) -> String {
let path = Backend::normalize_path(path);
// root must be normalized like `/abc/`
format!("{}{}", self.root, path)
.trim_start_matches('/')
.to_string()
}

/// get_rel_path will return the relative path of the given path in the s3 format.
pub(crate) fn get_rel_path(&self, path: &str) -> String {
let path = format!("/{}", path);

match path.strip_prefix(&self.root) {
None => unreachable!("invalid input that not start with backend root"),
Some(v) => v.to_string(),
None => unreachable!(
"invalid path {} that not start with backend root {}",
&path, &self.root
),
}
}
}
Expand All @@ -356,7 +376,7 @@ impl Accessor for Backend {
let resp = req
.send()
.await
.map_err(|e| parse_get_object_error(e, "read", &args.path))?;
.map_err(|e| parse_get_object_error(e, "read", &p))?;

Ok(Box::new(S3ByteStream(resp.body).into_async_read()))
}
Expand All @@ -375,7 +395,7 @@ impl Accessor for Backend {
)))
.send()
.await
.map_err(|e| parse_unexpect_error(e, "write", &args.path))?;
.map_err(|e| parse_unexpect_error(e, "write", &p))?;

Ok(args.size as usize)
}
Expand All @@ -390,15 +410,37 @@ impl Accessor for Backend {
.key(&p)
.send()
.await
.map_err(|e| parse_head_object_error(e, "stat", &args.path))?;
.map_err(|e| parse_head_object_error(e, "stat", &p));

match meta {
Ok(meta) => {
let mut m = Metadata::default();
m.set_path(&args.path);
m.set_content_length(meta.content_length as u64);

if p.ends_with('/') {
m.set_mode(ObjectMode::DIR);
} else {
m.set_mode(ObjectMode::FILE);
};

let mut m = Metadata::default();
m.set_path(&args.path)
.set_content_length(meta.content_length as u64)
.set_mode(ObjectMode::FILE)
.set_complete();
m.set_complete();

Ok(m)
Ok(m)
}
// Always returns empty dir object if path is endswith "/" and we got an
// ObjectNotExist error.
Err(e) if (e.kind() == Kind::ObjectNotExist && p.ends_with('/')) => {
let mut m = Metadata::default();
m.set_path(&args.path);
m.set_content_length(0);
m.set_mode(ObjectMode::DIR);
m.set_complete();

Ok(m)
}
Err(e) => Err(e),
}
}

async fn delete(&self, args: &OpDelete) -> Result<()> {
Expand All @@ -411,13 +453,17 @@ impl Accessor for Backend {
.key(&p)
.send()
.await
.map_err(|e| parse_unexpect_error(e, "delete", &args.path))?;
.map_err(|e| parse_unexpect_error(e, "delete", &p))?;

Ok(())
}

async fn list(&self, args: &OpList) -> Result<BoxedObjectStream> {
let path = self.get_abs_path(&args.path);
let mut path = self.get_abs_path(&args.path);
// Make sure list path is endswith '/'
if !path.ends_with('/') && !path.is_empty() {
path.push('/')
}

Ok(Box::new(S3ObjectStream::new(
self.clone(),
Expand Down
Loading