Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional implicit directory behavior #7

Closed
jacobsa opened this issue Mar 19, 2015 · 9 comments
Closed

Add optional implicit directory behavior #7

jacobsa opened this issue Mar 19, 2015 · 9 comments
Assignees
Milestone

Comments

@jacobsa
Copy link
Contributor

jacobsa commented Mar 19, 2015

As discussed in the semantics doc, we require objects to exist for directories as well as files; there is no such thing as an implicit directory. If an object named foo/bar exists but no object named foo/ exists, then the file system behaves as if foo/bar does not exist. So if the user mounts a bucket containing only an object named foo/bar and then does cat foo/bar, they will get a "file not found" error.

Issue

When the user does cat foo/bar, fuse sends the following requests to gcsfuse:

  1. Look up the name "foo" within the root inode. Return its inode ID and whether it's a file or a directory, or fail if it's non-existent. Call the returned inode F.
  2. Look up the name "bar" within F. Return its inode ID and whether it's a file or a directory. Call the inode B.
  3. Read the contents of B.

The fundamental issue is that at the point of the call in (1), gcsfuse can see that an object named foo doesn't exist and therefore can say "foo" doesn't refer to a file, but needs to decide between telling the kernel that it doesn't exist at all or telling the kernel that it refers to a directory.

Current behavior

The current behavior is that in (1) gcsfuse asks GCS to do a consistent read of the metadata for two objects, foo and foo/. If it finds the first it calls "foo" a file, if it finds the second it calls it a directory, and if it finds neither it says it doesn't exist. That's why we require the object foo/ to exist for the directory to appear to exist.

This method works because unlike listing objects by prefix, a read of the metadata for a single object is guaranteed to be fresh.

Alternatives

Listing-based lookup

One alternative is that we implement (1) by asking whether foo exists (as today) and by scanning objects with the prefix foo/, saying that "foo" is a directory if the scan is non-empty. But there are drawbacks here:

  • With this setup, it will appear as if there is a directory called "foo" containing a file called "bar". But when the user does rm foo/bar, suddenly it will appear as if the file system is completely empty. This is contrary to expectations, since the user hasn't done rmdir foo.
  • Similarly, if the user does rm foo/bar then touch foo/baz, the second command will fail with a surprising "no such file or directory" error.
  • The operation of scanning the prefix foo/ maps down to an unbounded number of requests to GCS, since each response contains a continuation token that must be redeemed to continue scanning and GCS does only a limited amount of work before bailing out and returning this. This means a single simple path resolution may result in enormous expense.
    • It's possible that this is a non-issue if GCS guarantees that a) the first response for a non-empty range is non-empty, and that b) only one request is required for an empty range. This is not documented anywhere, so we would need to check with the GCS team about it. In particular, this would be impossible to guarantee if GCS uses tombstone records in the data source that is scanned when listing objects.
    • For example, assume that a large directory used to exist but many or all of the objects within it have been deleted. Upon each Objects.list request, GCS will scan the hidden tombstones for those dead objects for awhile and then return an empty response with a continuation token. We may need to scan the entire range of tombstones (resulting in many requests) before we are sure the directory still exists. If all of its contents are gone, then we definitely must scan the entire range before we can say so.
    • Update: Indeed, nherring tells me over IM that GCS offers neither guarantee: "Under conditions where many objects have been recently deleted, yes, you can can receive no useful answer and a page token".
  • Because listings may be arbitrarily far out of date, this will be a flaky user experience unless the bucket is never modified. Some behaviors that this might cause:
    • The user creates an object by some other means—storage explorer, gsutil, etc.—and the implicitly-defined directory doesn't show up as existing for minutes or even hours. This is not unlikely based on the listing freshness numbers I've seen.
    • The user does rm foo/bar. As discussed above now the directory "foo" no longer exists because it was only implicitly defined, so the user gets the surprising behavior of touch foo/baz failing. Except they only get behavior once the listing catches up. Worse, if they try the experiment several times then it may fail, succeed, fail, succeed, and fail again.

Even if GCS eventually offers list-your-own-writes consistency, negating the last point, the other issues remain.

Fixup tool

If users want to mount buckets where they've created object names assuming that implicit directories will work, we can create a "fixup" tool that lists the buckets and creates the appropriate objects for the implicit directories.

The main caveat here is that the tool would itself depend upon listing, so may miss some objects in the bucket for the same reasons discussed above. Another caveat is the need to run such a tool, but the behavior could be built into the gcsfuse binary itself (either as the default when mounting or on an opt in basis).

@jacobsa
Copy link
Contributor Author

jacobsa commented Mar 19, 2015

I currently lean toward treating this in the same way as the fast vs. consistent tradeoffs elsewhere in the product: make it configurable at mount time. (I know this is a cliche, but I don't have a better answer.)

Users with existing buckets with objects containing slash delimiters but no placeholder directory objects can have those objects show up, at the cost of flaky name lookups and weird directory existence behavior. Careful users who want to write software against the file system (including me when writing tests) can have sane and non-flaky behavior at the cost of potentially stranded objects.

More concretely: when serving a lookup, follow the current behavior. If the answer is "not found", optionally fall back to the listing path discussed above.

@jacobsa jacobsa self-assigned this Mar 25, 2015
@jacobsa jacobsa added this to the alpha milestone Mar 25, 2015
@jacobsa
Copy link
Contributor Author

jacobsa commented Mar 25, 2015

Decision with Jay: make it configurable, defaulting to the safe/non-flaky behavior, but with prominent advertising for the option.

@jacobsa jacobsa changed the title Decide what to do about placeholder directory objects Add optional implicit directory behavior Mar 25, 2015
@jacobsa
Copy link
Contributor Author

jacobsa commented Mar 26, 2015

Implementation notes:

  • Config struct when creating file system. Has a bool for this option, default off.
  • Plumb the config struct into the test setup process.
  • Separate test for this feature, sets the bool.
  • If the test is flaky, we can have it not run for integration tests, using a branch on some "listings are consistent" bool trait supplied by the bucket wiring code.
  • Document the flag prominently in the readme and semantics docs, with more details in the latter. Call out to this thread for full details, but inline a summary of the downsides of turning the flag on and having it off.

jacobsa added a commit that referenced this issue Mar 26, 2015
In preparation for adding more configuration options. See #7 for more.
@jacobsa
Copy link
Contributor Author

jacobsa commented Mar 27, 2015

Now that we've decided that conflicting file/directory names shouldn't cause one of the two to be hidden (cf. #28), I think we can't use the "fall back to listing only if the usual method fails" algorithm in LookUpInode. If we do, implicit directories will be hidden by files with conflicting names.

I think the logic should be:

  • In parallel:
    • Check for name existing as file.
    • This is implemented by statting.
    • Check for name existing as directory.
    • This is implemented by statting, and optionally in parallel listing and looking for a non-empty result.
  • If the object exists as a directory, return the directory inode.
  • Otherwise if it exists as a file, return the file inode.
  • Otherwise return ENOENT.

Later, when #28 is worked on, the last part of the logic can be updated to deal with the \n sentinel thing discussed in that issue.

jacobsa added a commit that referenced this issue Mar 27, 2015
The knob is not currently exposed to the user.

For #7.
jacobsa added a commit that referenced this issue Mar 27, 2015
@jacobsa jacobsa closed this as completed Mar 27, 2015
@friday
Copy link

friday commented Dec 1, 2016

I wasn't happy with the performance of --implicit dirs, so I wrote a bash oneliner to create missing dirs. You need to set up gsutil with your account first, and substitute the variables for your bucket name and mount point:

BUCKET=my-bucket MOUNTED_AT=/path/to/mount; gsutil ls "gs://$BUCKET/*/*/**" | xargs dirname | sort | uniq | sed "s/gs:\/\/$BUCKET\///" | xargs -I % mkdir -p "$MOUNTED_AT/%"

It doesn't print any progress info, but it worked like a charm for me. Some unix flavors might need the arguments adjusted though.

Update:
Changed "gs://$BUCKET/**" to "gs://$BUCKET/*/*/**" to fix a harmless issue detailed below.

@MongoExpUser
Copy link

MongoExpUser commented Mar 19, 2018

Just:

Create bucket
gsutil mb gs://myBucket/

Mount bucket
sudo gcsfuse myBucket/path-to-mount/folder

See google link for the create bucket part and other commands using gsutil: https://cloud.google.com/storage/docs/how-to

@FossPrime
Copy link

FossPrime commented May 11, 2018

Seems to work in MacOS with bash 4. Function is optional

function fix-material() {
  BUCKET=mybucket
  MOUNTED_AT=/Users/mount
  gsutil ls "gs://${BUCKET}/**" | xargs -E '\n' -n1 dirname | sort | uniq | sed "s/gs:\/\/${BUCKET}\///" | xargs -I % mkdir -p "${MOUNTED_AT}/%"
}

Edit:
I got an extraneous gs: folder made at the root of the bucket for some reason. Be sure to run find . -type d | wc -l with and without implicit dirs to verify. When I did it the numbers were equal, after deleting the random gs directory, which may have been placed there by my debugging attempts.

@friday
Copy link

friday commented May 15, 2018

@rayfoss I updated my comment changing the globbing pattern to skip the root level. dirname removes the last part, ie "fileorfolder" in "gs://my-bucket/subdirectory/fileorfolder" so if you had files or folders in the root level (like if you run the this twice) you would get a "gs:" folder. Pretty harmless issue though.

@pkdetlefsen
Copy link

@rayfoss @friday I had folders at root level that needed to be created as well so I used the "gs://$BUCKET/**" pattern and instead used tail -n +2 to remove the root folder entry from directory list.

BUCKET=my-bucket MOUNTED_AT=/path/to/mount; gsutil ls "gs://$BUCKET/**" | xargs dirname | sort | uniq | tail -n +2 | sed "s/gs:\/\/$BUCKET\///" | xargs -I % mkdir -p "$MOUNTED_AT/%"

Thanks a lot for the oneliner!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants