Add optional implicit directory behavior #7

jacobsa · 2015-03-19T00:22:50Z

As discussed in the semantics doc, we require objects to exist for directories as well as files; there is no such thing as an implicit directory. If an object named foo/bar exists but no object named foo/ exists, then the file system behaves as if foo/bar does not exist. So if the user mounts a bucket containing only an object named foo/bar and then does cat foo/bar, they will get a "file not found" error.

Issue

When the user does cat foo/bar, fuse sends the following requests to gcsfuse:

Look up the name "foo" within the root inode. Return its inode ID and whether it's a file or a directory, or fail if it's non-existent. Call the returned inode F.
Look up the name "bar" within F. Return its inode ID and whether it's a file or a directory. Call the inode B.
Read the contents of B.

The fundamental issue is that at the point of the call in (1), gcsfuse can see that an object named foo doesn't exist and therefore can say "foo" doesn't refer to a file, but needs to decide between telling the kernel that it doesn't exist at all or telling the kernel that it refers to a directory.

Current behavior

The current behavior is that in (1) gcsfuse asks GCS to do a consistent read of the metadata for two objects, foo and foo/. If it finds the first it calls "foo" a file, if it finds the second it calls it a directory, and if it finds neither it says it doesn't exist. That's why we require the object foo/ to exist for the directory to appear to exist.

This method works because unlike listing objects by prefix, a read of the metadata for a single object is guaranteed to be fresh.

Alternatives

Listing-based lookup

One alternative is that we implement (1) by asking whether foo exists (as today) and by scanning objects with the prefix foo/, saying that "foo" is a directory if the scan is non-empty. But there are drawbacks here:

With this setup, it will appear as if there is a directory called "foo" containing a file called "bar". But when the user does rm foo/bar, suddenly it will appear as if the file system is completely empty. This is contrary to expectations, since the user hasn't done rmdir foo.
Similarly, if the user does rm foo/bar then touch foo/baz, the second command will fail with a surprising "no such file or directory" error.
The operation of scanning the prefix foo/ maps down to an unbounded number of requests to GCS, since each response contains a continuation token that must be redeemed to continue scanning and GCS does only a limited amount of work before bailing out and returning this. This means a single simple path resolution may result in enormous expense.
- It's possible that this is a non-issue if GCS guarantees that a) the first response for a non-empty range is non-empty, and that b) only one request is required for an empty range. This is not documented anywhere, so we would need to check with the GCS team about it. In particular, this would be impossible to guarantee if GCS uses tombstone records in the data source that is scanned when listing objects.
- For example, assume that a large directory used to exist but many or all of the objects within it have been deleted. Upon each Objects.list request, GCS will scan the hidden tombstones for those dead objects for awhile and then return an empty response with a continuation token. We may need to scan the entire range of tombstones (resulting in many requests) before we are sure the directory still exists. If all of its contents are gone, then we definitely must scan the entire range before we can say so.
- Update: Indeed, nherring tells me over IM that GCS offers neither guarantee: "Under conditions where many objects have been recently deleted, yes, you can can receive no useful answer and a page token".
Because listings may be arbitrarily far out of date, this will be a flaky user experience unless the bucket is never modified. Some behaviors that this might cause:
- The user creates an object by some other means—storage explorer, gsutil, etc.—and the implicitly-defined directory doesn't show up as existing for minutes or even hours. This is not unlikely based on the listing freshness numbers I've seen.
- The user does rm foo/bar. As discussed above now the directory "foo" no longer exists because it was only implicitly defined, so the user gets the surprising behavior of touch foo/baz failing. Except they only get behavior once the listing catches up. Worse, if they try the experiment several times then it may fail, succeed, fail, succeed, and fail again.

Even if GCS eventually offers list-your-own-writes consistency, negating the last point, the other issues remain.

Fixup tool

If users want to mount buckets where they've created object names assuming that implicit directories will work, we can create a "fixup" tool that lists the buckets and creates the appropriate objects for the implicit directories.

The main caveat here is that the tool would itself depend upon listing, so may miss some objects in the bucket for the same reasons discussed above. Another caveat is the need to run such a tool, but the behavior could be built into the gcsfuse binary itself (either as the default when mounting or on an opt in basis).

The text was updated successfully, but these errors were encountered:

jacobsa · 2015-03-19T21:19:35Z

I currently lean toward treating this in the same way as the fast vs. consistent tradeoffs elsewhere in the product: make it configurable at mount time. (I know this is a cliche, but I don't have a better answer.)

Users with existing buckets with objects containing slash delimiters but no placeholder directory objects can have those objects show up, at the cost of flaky name lookups and weird directory existence behavior. Careful users who want to write software against the file system (including me when writing tests) can have sane and non-flaky behavior at the cost of potentially stranded objects.

More concretely: when serving a lookup, follow the current behavior. If the answer is "not found", optionally fall back to the listing path discussed above.

jacobsa · 2015-03-25T23:17:36Z

Decision with Jay: make it configurable, defaulting to the safe/non-flaky behavior, but with prominent advertising for the option.

jacobsa · 2015-03-26T20:52:54Z

Implementation notes:

Config struct when creating file system. Has a bool for this option, default off.
Plumb the config struct into the test setup process.
Separate test for this feature, sets the bool.
If the test is flaky, we can have it not run for integration tests, using a branch on some "listings are consistent" bool trait supplied by the bucket wiring code.
Document the flag prominently in the readme and semantics docs, with more details in the latter. Call out to this thread for full details, but inline a summary of the downsides of turning the flag on and having it off.

In preparation for adding more configuration options. See #7 for more.

jacobsa · 2015-03-27T01:58:11Z

Now that we've decided that conflicting file/directory names shouldn't cause one of the two to be hidden (cf. #28), I think we can't use the "fall back to listing only if the usual method fails" algorithm in LookUpInode. If we do, implicit directories will be hidden by files with conflicting names.

I think the logic should be:

In parallel:
- Check for name existing as file.
- This is implemented by statting.
- Check for name existing as directory.
- This is implemented by statting, and optionally in parallel listing and looking for a non-empty result.
If the object exists as a directory, return the directory inode.
Otherwise if it exists as a file, return the file inode.
Otherwise return ENOENT.

Later, when #28 is worked on, the last part of the logic can be updated to deal with the \n sentinel thing discussed in that issue.

The knob is not currently exposed to the user. For #7.

For #7.

friday · 2016-12-01T16:31:24Z

I wasn't happy with the performance of --implicit dirs, so I wrote a bash oneliner to create missing dirs. You need to set up gsutil with your account first, and substitute the variables for your bucket name and mount point:

BUCKET=my-bucket MOUNTED_AT=/path/to/mount; gsutil ls "gs://$BUCKET/*/*/**" | xargs dirname | sort | uniq | sed "s/gs:\/\/$BUCKET\///" | xargs -I % mkdir -p "$MOUNTED_AT/%"

It doesn't print any progress info, but it worked like a charm for me. Some unix flavors might need the arguments adjusted though.

Update:
Changed "gs://$BUCKET/**" to "gs://$BUCKET/*/*/**" to fix a harmless issue detailed below.

MongoExpUser · 2018-03-19T11:28:32Z

Just:

Create bucket
gsutil mb gs://myBucket/

Mount bucket
sudo gcsfuse myBucket/path-to-mount/folder

See google link for the create bucket part and other commands using gsutil: https://cloud.google.com/storage/docs/how-to

FossPrime · 2018-05-11T00:01:59Z

Seems to work in MacOS with bash 4. Function is optional

function fix-material() {
  BUCKET=mybucket
  MOUNTED_AT=/Users/mount
  gsutil ls "gs://${BUCKET}/**" | xargs -E '\n' -n1 dirname | sort | uniq | sed "s/gs:\/\/${BUCKET}\///" | xargs -I % mkdir -p "${MOUNTED_AT}/%"
}

Edit:
I got an extraneous gs: folder made at the root of the bucket for some reason. Be sure to run find . -type d | wc -l with and without implicit dirs to verify. When I did it the numbers were equal, after deleting the random gs directory, which may have been placed there by my debugging attempts.

friday · 2018-05-15T22:14:56Z

@rayfoss I updated my comment changing the globbing pattern to skip the root level. dirname removes the last part, ie "fileorfolder" in "gs://my-bucket/subdirectory/fileorfolder" so if you had files or folders in the root level (like if you run the this twice) you would get a "gs:" folder. Pretty harmless issue though.

pkdetlefsen · 2018-05-30T11:27:37Z

@rayfoss @friday I had folders at root level that needed to be created as well so I used the "gs://$BUCKET/**" pattern and instead used tail -n +2 to remove the root folder entry from directory list.

BUCKET=my-bucket MOUNTED_AT=/path/to/mount; gsutil ls "gs://$BUCKET/**" | xargs dirname | sort | uniq | tail -n +2 | sed "s/gs:\/\/$BUCKET\///" | xargs -I % mkdir -p "$MOUNTED_AT/%"

Thanks a lot for the oneliner!

jacobsa self-assigned this Mar 25, 2015

jacobsa added this to the alpha milestone Mar 25, 2015

jacobsa changed the title ~~Decide what to do about placeholder directory objects~~ Add optional implicit directory behavior Mar 25, 2015

jacobsa added a commit that referenced this issue Mar 26, 2015

Refactored NewServer to accept a struct.

a249338

In preparation for adding more configuration options. See #7 for more.

jacobsa added a commit that referenced this issue Mar 27, 2015

Added an implicit directory feature, with tests.

d28ef5d

The knob is not currently exposed to the user. For #7.

jacobsa added a commit that referenced this issue Mar 27, 2015

Added and documented a --implicit_dirs flag.

8e66450

For #7.

jacobsa closed this as completed Mar 27, 2015

This was referenced Oct 27, 2022

Directories missing in GCS file_mounts skypilot-org/skypilot#1154

Closed

[Storage] add --implicit-dirs for gcsfuse skypilot-org/skypilot#1312

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional implicit directory behavior #7

Add optional implicit directory behavior #7

jacobsa commented Mar 19, 2015

jacobsa commented Mar 19, 2015

jacobsa commented Mar 25, 2015

jacobsa commented Mar 26, 2015

jacobsa commented Mar 27, 2015

friday commented Dec 1, 2016 •

edited

Loading

MongoExpUser commented Mar 19, 2018 •

edited

Loading

FossPrime commented May 11, 2018 •

edited

Loading

friday commented May 15, 2018

pkdetlefsen commented May 30, 2018

Add optional implicit directory behavior #7

Add optional implicit directory behavior #7

Comments

jacobsa commented Mar 19, 2015

Issue

Current behavior

Alternatives

Listing-based lookup

Fixup tool

jacobsa commented Mar 19, 2015

jacobsa commented Mar 25, 2015

jacobsa commented Mar 26, 2015

jacobsa commented Mar 27, 2015

friday commented Dec 1, 2016 • edited Loading

MongoExpUser commented Mar 19, 2018 • edited Loading

FossPrime commented May 11, 2018 • edited Loading

friday commented May 15, 2018

pkdetlefsen commented May 30, 2018

friday commented Dec 1, 2016 •

edited

Loading

MongoExpUser commented Mar 19, 2018 •

edited

Loading

FossPrime commented May 11, 2018 •

edited

Loading