Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFM: mark/reset for IOStream, IOBuffer, & AsyncStream (addresses #2638) #3656

Merged
merged 1 commit into from
Jun 30, 2014

Conversation

kmsquire
Copy link
Member

@kmsquire kmsquire commented Jul 9, 2013

Edit:

The current patch provides:

  1. mark(), reset(), unmark(), and ismarked() for IOStream and IOBuffer (and AsyncStream through IOBuffer)
  2. Simple testing for AsyncStreams and IOStream

These are loosely based on ideas from Java's BufferedReader, which allows a programmer to mark a location in an IO stream, and then seek back to that location later, even if the IO stream itself is not seekable:

m = mark(b::IO)  - marks a position in the wrapped IO object
reset(b::IO, m)  - resets b back to a marked position; the mark is removed
unmark(b::IO, m) - removes a mark
ismarked(b::IO)  - returns true if b is marked

See #2638 for further discussion.

@johnmyleswhite
Copy link
Member

Very cool!

return true
end
function compact(io::IOBuffer)
if !io.writable error("compact failed") end
if io.seekable error("compact failed") end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test was to ensure data integrity and enforce the distinction between a file-like object and a stream-like object. i don't think it should be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll put this back. I thought I needed this before I added mark() and company to AsyncStreams, but I'm pretty sure it isn't necessary now.

@vtjnash
Copy link
Member

vtjnash commented Jul 9, 2013

+1

@StefanKarpinski
Copy link
Member

This is very cool functionality. Let's let this get reviewed as an API before we merge it though – I have a nagging feeling about this for some reason.

@JeffBezanson
Copy link
Member

I agree with Stefan, it is very cool, but something (possibly minor) is bothering me too. It could be that BufferedReader reminds me of java. I've also never needed a mark/seekmark API, but I appreciate how it is more general than peek/ungetc. Maybe it's that peek requires three operations (mark, read, seekmark).

BufferedReader seems like a nice abstraction, but I'm not sure when I'd use it. In practice, buffering needs deep integration with I/O and can't really be done in a separate layer. For example, AsyncStream has a callback for allocating buffer space, and then libuv fills the buffer behind the scenes. Maybe the equivalent of BufferedReader should be done by wrapping future I/O streams in the current ones, perhaps via a pipe.

@kmsquire
Copy link
Member Author

@JeffBezanson , I've been updating this, and didn't see your comments until after the updates (which I finished earlier and just pushed). I've modified the description at the top with what the current patch provides.

BufferedReader doesn't really play a role anymore. It was a bit unwieldy--wrapping other types nicely isn't one of Julia's strong suits. It does, however, maintain the mark, seekmark, and unmark mechanism, as well, as allow the following (based on Stefan's suggestion in #2638).

mark(io) do f
   a = chomp(readline(f))
   if a == "JL"
      seekmark(f)  #jump back to mark
      result = io_handler(f)
   end
   ...
end 
# exiting the do block resets io (calling seekmark(io))
b = chomp(readline(io))
# b == "JL" (if it did above)

It should actually be trivial to implement a multibyte peek based on this, or in a similar way.

@kmsquire
Copy link
Member Author

Didn't have a chance to finish responding earlier.

I generally agree with most of @JeffBezanson's comments above. My original motivation was simply to have a multibyte peek, which requires some sort of buffering, and there was some interest in this mechanism when I brought it up in #2638.

The main point of this interface, I think, is when parsing an input stream, and you need to read some of the stream in order to determine which parser to use. An example where this would be useful:

I use files that are compressed in bgzip format, which is a gzip variant with an extra section in the header giving the compressed block length, allowing indexing and random access into a bgzip file. It would be useful to read the gzip header to see if this extra header section was there, and then reset the stream and send it off to the correct decompressor. A similar argument could be made for auto-detecting other file types.

Removing BufferedReader would still leaves mark, unmark, and seekmark for IOStream, IOBuffer, and AsyncStream. Is there interest in having this mechanism in base?

( seekmark is reset in Java, which is starting to make more sense)

@JeffBezanson
Copy link
Member

I think I can agree that mark/seekmark is much better than peek or ungetc. The latter are clearly limited to single bytes (maybe characters?), and it's not so clear how to handle errors in peek.

@JeffBezanson
Copy link
Member

lockmark etc. aren't necessary. Can't mark just check that there is no existing mark? That would also make it easier to extend to multiple marks in the future (though not sure if that's a good idea).

@kmsquire
Copy link
Member Author

One step ahead of you (barely). (I did leave it WIP for a reason! :-)

The locks are necessary (or at least a good idea) if you want to use marking with the do syntax: you probably don't want someone to close the file or mess with the marks while inside the do block.

But, really, that syntax is just sugar, and it complicates things unnecessarily, so I just removed it. The current request is pretty minimal.

I'm going to squash, and actually change it to RFC--let me know if you have additional feedback.

@kmsquire
Copy link
Member Author

I hesitated to pollute the Base namespace with useful names, but given the typical use cases, I decided that I liked reset() better than seekmark(). I can change it back if that's desired.

@kmsquire
Copy link
Member Author

Should reset() remove the mark? For IOBuffer, forgetting to unmark() will cause the buffer to grow.

@kmsquire
Copy link
Member Author

Updates:

  • reset() now removes the mark--this seemed safest
  • mark() throws an error if the stream is already marked

@StefanKarpinski
Copy link
Member

What about this following scenario... You have code that sets a mark and then calls some other routines to do processing on the stream. The code that it calls also wants to set a mark and possibly rewind in order to do its thing. Now this will fail. I don't see why the callee or caller should care if the other sets a mark or not. It seems to me that the only way to make this work is to have a mark stack. Ideally, however, the API should not make it possible (or at least easy) to reset when you haven't marked or reset someone else's mark.

@kmsquire
Copy link
Member Author

The "do" version of mark would make that easier, by forcing the mark to be removed at the exit of the do-block. As I noted above, I think this is best accompanied by a locking mechanism, which could be made to allow code within the block (or called from with the block) to set new marks, but not release previous marks.

However, I think that's getting a little complicated, and should probably be part of a separate pull request.

@JeffBezanson
Copy link
Member

The API could be

m = mark(io)
...
reset(io, m)

where reset requires the marks to be released in stack order. This API can also be used now, just with the stack size limited to 1.

@kmsquire
Copy link
Member Author

Another possibility is to have mark give back a token, and require that token when resetting the current mark.

@kmsquire
Copy link
Member Author

Okay, that would work. It would have to be a token and not a location, at least for IOBuffer. The buffered data can be moved in the buffer (in compact), causing the location of all tokens to be moved.

@kmsquire
Copy link
Member Author

I finally got around to finishing this up. Marks are implemented as a stack, per @StefanKarpinski and @JeffBezanson's suggestion, and users are required to keep track of mark tokens, as in @JeffBezanson's example above. The patch includes docs and tests.

The tests pass on my system. Travis seems to have failed because of a missing ncurses dependency.

If there are no additional comments or suggestions, I think this is good to merge.

@kmsquire
Copy link
Member Author

Whoops, no, it was my fault. Forgot to add a file when I was cleaning things up. Fixed in a sec...

@kmsquire
Copy link
Member Author

Travis gcc build passed, clang failed... not clear to me why.

@vtjnash
Copy link
Member

vtjnash commented Aug 7, 2013

bump. lgtm. can this be merged now? (travis build error was unrelated)

@Keno
Copy link
Member

Keno commented Aug 7, 2013

Fine with me.

@kmsquire
Copy link
Member Author

kmsquire commented Aug 8, 2013

I'll rebase in the morning.

@kmsquire
Copy link
Member Author

kmsquire commented Aug 8, 2013

Rebased. It would be good to hear from @StefanKarpinski and/or @JeffBezanson before merging, as they have expressed the strongest opinions about previous versions (and the current implementation is based on their suggestions). Otherwise, this should be good to go.

@StefanKarpinski
Copy link
Member

As long as this allows you to stack marks, I defer to @JeffBezanson's opinion.

@kmsquire
Copy link
Member Author

kmsquire commented Aug 8, 2013

Marks are stacked, although unlike your original proposal, reseting a mark removes all later marks. I did this under the assumption that later marks would most likely have been added by called functions (or inner code blocks), which would have exited by the time reset is called. It also simplified the code if I assumed that later marks are always identical to or greater than earlier marks. These assumptions could be removed, at the cost of complicating the code some.

@vtjnash vtjnash added this to the 0.3 milestone Jun 16, 2014
@StefanKarpinski
Copy link
Member

Please stop adding changes to 0.3!

@kmsquire kmsquire mentioned this pull request Jun 16, 2014
@kmsquire
Copy link
Member Author

Rebased. Although I would love to see this merged (if only so I can stop wasting my time rebasing), I can also sympathize with @StefanKarpinski's desire not to muck with the v0.3 requirements.

So if this should go in, please merge it (when travis passes, which it should). If it should not go in, please remove the v0.3 tag, and we can merge as soon as v0.4 opens.

@StefanKarpinski
Copy link
Member

There's also a complementary functionality to consider: cork and uncork – corking an output stream would prevent things printed to it from being flushed, buffering output until an uncork. This can be useful for making sure that output is contiguous. Not sure if it makes sense to think about these together. There seems to be some relation.

@JeffBezanson
Copy link
Member

I want to merge this but I find it very hard to stomach allocating an extra array for every IOBuffer. That's a whole extra garbage array for every string you build.

@JeffBezanson
Copy link
Member

Isn't the ability to remove arbitrary marks (with splice!) dangerous? It silently shifts the meaning of all later marks. If the common case is that there are no later marks, why not only allow removing the last mark?

@vtjnash
Copy link
Member

vtjnash commented Jun 20, 2014

good point. would it be worth leaving this field uninitialized until you use it for the first time? or using tuples?

@vtjnash
Copy link
Member

vtjnash commented Jun 20, 2014

it removes all marks to the end, thereby invalidating all later marks (although it doesn't have a way to truly validate them later)

@JeffBezanson
Copy link
Member

Ah, of course. I find it strange to remove arbitrary marks; it's just hard to see when I would use that.

@kmsquire
Copy link
Member Author

I want to merge this but I find it very hard to stomach allocating an extra array for every IOBuffer. That's a whole extra garbage array for every string you build.

That's fair. There are also a lot of other things in IOBuffer which are not applicable to strings (although they don't cost as much). Why not make string building a separate class, and merge this after that? Not for 0.3, of course, as useful as this would be.

@vtjnash
Copy link
Member

vtjnash commented Jun 20, 2014

a few extra Bools and an Int allocated inline is much less than an Array object

@kmsquire
Copy link
Member Author

I find it strange to remove arbitrary marks; it's just hard to see when I would use that.

It's been a while since I implemented this, but if you read back through the comments above, the current implementation was loosely based on #3656 (comment).

Originally, there was only one mark--there's no reason it couldn't be changed back to that.

In theory, the current mechanism could be used for complicated or multilevel header parsing, as in multimedia files, or perhaps as input to backtracking parsers. But these are more obscure, and no one is clamoring for these features.

@kmsquire
Copy link
Member Author

a few extra Bools and an Int allocated inline is much less than an Array object

That's true. But at least if string building was a separate entity, the extra array allocation in this PR wouldn't affect string building.

Anyway, I just took out the array and replaced it with another inline Bool and Int (so only 1 mark is allowed). Will resubmit shortly.

@kmsquire
Copy link
Member Author

Okay, I reverted back to 1 mark per IO stream, maximum (and stashed the previous version in a different branch).

I started to rename the functions to mark!, unmark!, and reset!, but this really didn't jive with the rest of the IO commands (especially seek). So this is just like the previous PR, but no extra array in each IOBuffer, and only one mark per IO.

@JeffBezanson
Copy link
Member

I think I see now; it's a stack of marks but you can remove any number from the end for something like exception unwinding.

To me it's nice for I/O code to also be string building code. In fact it's hard to distinguish them, since we often write text-producing code using calls to print, and the same code is used for I/O and building strings. It would be unfortunate if there were 2 ways to do it, one of which is slightly faster.

Maybe this is a case where a linked list is appropriate. You wouldn't expect the mark stack to get very deep, and we could incrementally add one tiny object per mark.

@kmsquire
Copy link
Member Author

It could certainly be done with a linked-list. The behavior and logic would be different (and I think more complicated) than the array implementation.

At any rate, I won't have time to work on this soon. The current version allows just one mark, with just one extra field (an Int holding the marked position, negative if no mark).

I think this is still useful, and generic enough that it can be extended easily in the future.

@JeffBezanson, if you agree (and travis is green), take a quick look and merge if all is good.

If not, please remove the 0.3 tag, and we can return to this in 0.4.

(If anyone else wants to update this to use a linked list, feel free.)


.. function:: unmark(s)

Remove a mark from stream ``s``. Throws an error if ``s`` is not marked.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second sentence is wrong; looks like it can just be removed.

Also added show() for IOBuffer.

* Define mark, reset, unmark, ismarked for IO only.
  This is slightly sketchy, as it assumes that IO.mark exists, but
  is perhaps better than duplicating the code for all subtypes.
@kmsquire
Copy link
Member Author

Okay, I've redefined mark, unmark, reset, and ismarked for IO, to remove the duplication and in response to @JeffBezanson's comment to that effect. (As he pointed out, it is slightly sketchy, but it's possibly better than duplicating code.)

I also fixed the unmark docs.

I considered creating a simple interface (ala graphics) which required IO subtypes to implement
getmark and setmark!, and ismarked, and use these in the current functions. This would actually be cleaner, but wouldn't be much different than the previous interface which duplicated the code for mark, reset, etc., since these would still have to be implemented for each subtype. But it could be done.

JeffBezanson added a commit that referenced this pull request Jun 30, 2014
RFM: mark/reset for IOStream, IOBuffer, & AsyncStream (addresses #2638)
@JeffBezanson JeffBezanson merged commit 03294f8 into JuliaLang:master Jun 30, 2014
@kmsquire
Copy link
Member Author

kmsquire commented Jul 2, 2014

Thanks!

On Monday, June 30, 2014, Jeff Bezanson [email protected] wrote:

Merged #3656 #3656.


Reply to this email directly or view it on GitHub
#3656 (comment).

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 990f8f1 on kmsquire:buffered_reader into * on JuliaLang:master*.

@kmsquire kmsquire deleted the buffered_reader branch August 3, 2015 14:18
KristofferC pushed a commit that referenced this pull request Oct 17, 2023
Stdlib: Pkg
URL: https://github.com/JuliaLang/Pkg.jl.git
Stdlib branch: master
Julia branch: master
Old commit: b02fb9597
New commit: ffb6edf03
Julia version: 1.11.0-DEV
Pkg version: 1.11.0
Bump invoked by: @IanButterworth
Powered by:
[BumpStdlibs.jl](https://github.com/JuliaLang/BumpStdlibs.jl)

Diff:
JuliaLang/Pkg.jl@b02fb95...ffb6edf

```
$ git log --oneline b02fb9597..ffb6edf03
ffb6edf03 cache pidlock tweaks (#3654)
550eadd7e Pin registry for MetaGraph tests (#3666)
ee39026b8 Remove test that depends on Random being in the sysimg (#3656)
561508db2 CI: Increase the CI timeout. Update actions. Fix double precompilation. (#3665)
7c7ed63b1 Remove change UUID script it should be uncessary on Julia v1.11-dev (#3655)
a8648f7c8 Precompile: Fix algorithmic complexity of cycle detection (#3651)
0e0cf4514 Switch datastructure Vector -> Set for algorithmic complexity (#3652)
894cc3f78 respect if load-time precompile is disabled (#3648)
3ffd1cf73 Make auto GC message use printpkgstyle (#3633)
```

Co-authored-by: Dilum Aluthge <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io Involving the I/O subsystem: libuv, read, write, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants