Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random Access API #889

Closed
swankjesse opened this issue Feb 6, 2021 · 9 comments
Closed

Random Access API #889

swankjesse opened this issue Feb 6, 2021 · 9 comments

Comments

@swankjesse
Copy link
Collaborator

swankjesse commented Feb 6, 2021

Goal

An API for reading files such that accesses aren’t strictly sequential.

Use Cases

  • Reading zip files. I’d like to be able to build something like Java’s ZipFile on Okio’s FileSystem. It isn’t very easy to do today.
  • Reading heap dumps like in Shark.
  • Reading offset-based binary files like dx.

Like Buffer.UnsafeCursor, I expect that random access is a power feature that most users shouldn’t need. Instead, it is a baseline for building higher-level abstractions like ZipFileSystem.

API

interface Source : Closeable {
  /**
   * The position of the next byte that [read] will return, or -1L if this source does not support
   * this feature. Increase the read position to efficiently skip bytes, or decrease it to rewind.
   * 
   * It is an error to set the position on a source that doesn't support it.
   */
  var position: Long
    get() = -1L
    set(value) {
      require(value == -1L) { "this source does not implement position" }
    }
  ...
}

Implementation

  • FileSystem: We can support this on all sources returned by FileSystem.SYTEM and FakeFileSystem.
  • BufferedSource: We can support this on BufferedSource if the source being buffered also supports it. The implementation would clear the buffer on seeking backwards, and removes (all?) bytes from the front of the buffer when seeking forwards.
  • ForwardingSource: We should not support this on the ForwardingSource base class because it’s hazardous for subclasses like HashingSource. (We’re fine for HashingSource ourselves, but unsafe to introduce because it may introduce similar problems for user subclasses.)
  • Buffer: We should not support this on Buffer. In theory we could support reading the position but not writing it; this would be the total number of bytes ever written to the buffer, independent of how many bytes have been consumed. But I think this is a bad idea; two empty buffers could have different read positions and this has weird consequences for equality.

Design Options

RandomAccess as a mixin interface

We could improve the API so that position support is included in the type system. For example:

interface RandomAccessSource : Source {
  /** ... */
  var position: Long
}

We would need to split BufferedSource into BufferedSource and BufferedRandomAccessSource. And ForwardingSource into ForwardingSource and ForwardingRandomAccessSource.

FileSystem.source() cannot return a RandomAccessSource because implementations like ZipFileSystem cannot efficiently implement random access on their contents. In fact, different files within a single file system may have different support; presumably seeking on a UNIX pipe is not possible.

I expect if we did this, users would still do runtime checks; they’d just use if (source is RandomAccessSource) rather than if (source.position != -1L).

Rename position to readPosition

We keep the door open to later support independent readPosition and writePosition properties on Buffer. (We may find a design to overcome the position + equality thing on buffers.)

Modes

Rather than using a sentinel value, -1L to indicate that position is unsupported, we could make the sources self-describe that they support. Many random access modes are possible:

  • No support: as today. the position doesn’t exist.
  • Read-only: the position is effectively a val that tells you how many bytes have been consumed so far. This is implementable on sockets and pipes.
  • Read-write: the position is a var that can seek forwards and backwards.
  • Seek forward only: could be emulated today using skip().

Size

We could add another field, size with the total number of bytes in the source. This could also be exposed in APIs like OkHttp’s response body. Returning a size of -1L indicates that the size is unknown.

@yschimke
Copy link
Collaborator

yschimke commented Feb 7, 2021

Mixin interface is awkward given we favour building decoratora elsewhere. Eg logging timing etc

@yschimke
Copy link
Collaborator

yschimke commented Feb 7, 2021

I've learned to dislike proto or thrift methods using primitives for results. If you had two results to return you'd have a return type. Which could then be changed to 3-4 results without API breaking.

Also from bytebuffer use cases with absolute and relative cases, its really awkward with setting position and then resetting.

Why didn't you consider using an optional nullable cursor approach? Cursor could be closeable and maybe optionally writeable?

Could also be useful if you design for a remote API like s3?

@yschimke
Copy link
Collaborator

yschimke commented Feb 7, 2021

On scope - useful to be clear what is in an out of scope. From description above S3 out of scope. Local memory or failed including compressed would be in scope. Make this clear in the type system. LocalCursor vs a field position on a very general API like Source. Scope becomes simpler if this is a peer of Source, not part of it.

@yschimke
Copy link
Collaborator

yschimke commented Feb 7, 2021

How does it sit with Source.

a) 1:1, it is the instance of the Source. (current suggestion)
b) 1:n, it is a child of an open Source.
c) 1:n is it a peer and alternative to a Source.

I'm assuming a) makes sense because Source could come from a non cloneable InputStream.

@yschimke
Copy link
Collaborator

yschimke commented Feb 7, 2021

On API usage - ByteBuffer is the existing JDK API that feels related. It makes it really awkward with positions, flipping, relative and absolute methods.

In contrast the recent read and write batch APIs from Okio filesystem worked really well within a session. Can we show some live examples from OkHttp, or Moshi to demonstrate why this API will be simpler for users.

@swankjesse
Copy link
Collaborator Author

@yschimke I love your recommendation of using an object, not a primitive. It fixes many problems! We can make it clear when random access is not available – the cursor is null! And it creates a convenient place to add helper things like size.

Here’s a new API sketch that uses the name Cursor as the top level type. It’s the least-worst name I’ve found for this.

interface Source : Closeable {
  /** A cursor for the read position if this supports random access, or null if it does not. */
  val cursor: Cursor?
  ...
}
interface Cursor {
  /**
   * The position within the underlying data that will next be accessed. This is non-negative and
   * less than or equal to the size of the underlying data. When this equals [size] the stream is
   * exhausted and further bytes cannot be read.
   */
  @Throws(IOException::class)
  fun position(): Long

  /**
   * The total number of bytes in the underlying data. This will change as the underlying data
   * changes, such as if another process appends data to the end of a file.
   */
  @Throws(IOException::class)
  fun size(): Long

  /** Adjust the position within the underlying data. */
  @Throws(IOException::class)
  fun seek(position: Long)
}

The motivating use case for all of this is ZipFileSystem. To do that right we need random access twice:

  • Once to read the index at the end of the zip file. It contains a list of entries and their offsets.
  • Once for each entry that is read. We immediately seek it to read that file.

Here’s a quick, dirty sketch of how that might look.

@yschimke
Copy link
Collaborator

yschimke commented Feb 7, 2021

That is crazy simple. Just realised it doesn't read the index within that sample.

Also can you pull out a readonly filesystem base class? Can use it for a ClasspathFilesystem.

@yschimke
Copy link
Collaborator

yschimke commented Feb 8, 2021

@swankjesse looks good, I think there are still open questions about whether cursor is 1:1 or 1:n. Can some methods take a nullable Cursor as a parameter to become relative to the cursor instead of Source? Are there things to learn from the ByteBuffer API, like limits? So the Zip index telling you a range within the file for the cursor? Fine with no, but worth defining the meaning.

@swankjesse
Copy link
Collaborator Author

This turned out well for read-only, terribly for read-write. Ultimately we went with FileHandle instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants