RFM: mark/reset for IOStream, IOBuffer, & AsyncStream (addresses #2638) #3656

kmsquire · 2013-07-09T08:05:39Z

Edit:

The current patch provides:

mark(), reset(), unmark(), and ismarked() for IOStream and IOBuffer (and AsyncStream through IOBuffer)
Simple testing for AsyncStreams and IOStream

These are loosely based on ideas from Java's BufferedReader, which allows a programmer to mark a location in an IO stream, and then seek back to that location later, even if the IO stream itself is not seekable:

m = mark(b::IO)  - marks a position in the wrapped IO object
reset(b::IO, m)  - resets b back to a marked position; the mark is removed
unmark(b::IO, m) - removes a mark
ismarked(b::IO)  - returns true if b is marked

See #2638 for further discussion.

johnmyleswhite · 2013-07-09T14:54:11Z

Very cool!

vtjnash · 2013-07-09T15:29:29Z

base/iobuffer.jl

    return true
 end
 function compact(io::IOBuffer)
    if !io.writable error("compact failed") end
-    if io.seekable error("compact failed") end


this test was to ensure data integrity and enforce the distinction between a file-like object and a stream-like object. i don't think it should be removed.

Yeah, I'll put this back. I thought I needed this before I added mark() and company to AsyncStreams, but I'm pretty sure it isn't necessary now.

vtjnash · 2013-07-09T15:29:39Z

+1

StefanKarpinski · 2013-07-09T15:54:16Z

This is very cool functionality. Let's let this get reviewed as an API before we merge it though – I have a nagging feeling about this for some reason.

JeffBezanson · 2013-07-12T01:15:01Z

I agree with Stefan, it is very cool, but something (possibly minor) is bothering me too. It could be that BufferedReader reminds me of java. I've also never needed a mark/seekmark API, but I appreciate how it is more general than peek/ungetc. Maybe it's that peek requires three operations (mark, read, seekmark).

BufferedReader seems like a nice abstraction, but I'm not sure when I'd use it. In practice, buffering needs deep integration with I/O and can't really be done in a separate layer. For example, AsyncStream has a callback for allocating buffer space, and then libuv fills the buffer behind the scenes. Maybe the equivalent of BufferedReader should be done by wrapping future I/O streams in the current ones, perhaps via a pipe.

kmsquire · 2013-07-12T01:51:20Z

@JeffBezanson , I've been updating this, and didn't see your comments until after the updates (which I finished earlier and just pushed). I've modified the description at the top with what the current patch provides.

BufferedReader doesn't really play a role anymore. It was a bit unwieldy--wrapping other types nicely isn't one of Julia's strong suits. It does, however, maintain the mark, seekmark, and unmark mechanism, as well, as allow the following (based on Stefan's suggestion in #2638).

mark(io) do f
   a = chomp(readline(f))
   if a == "JL"
      seekmark(f)  #jump back to mark
      result = io_handler(f)
   end
   ...
end 
# exiting the do block resets io (calling seekmark(io))
b = chomp(readline(io))
# b == "JL" (if it did above)

It should actually be trivial to implement a multibyte peek based on this, or in a similar way.

kmsquire · 2013-07-12T06:34:31Z

Didn't have a chance to finish responding earlier.

I generally agree with most of @JeffBezanson's comments above. My original motivation was simply to have a multibyte peek, which requires some sort of buffering, and there was some interest in this mechanism when I brought it up in #2638.

The main point of this interface, I think, is when parsing an input stream, and you need to read some of the stream in order to determine which parser to use. An example where this would be useful:

I use files that are compressed in bgzip format, which is a gzip variant with an extra section in the header giving the compressed block length, allowing indexing and random access into a bgzip file. It would be useful to read the gzip header to see if this extra header section was there, and then reset the stream and send it off to the correct decompressor. A similar argument could be made for auto-detecting other file types.

Removing BufferedReader would still leaves mark, unmark, and seekmark for IOStream, IOBuffer, and AsyncStream. Is there interest in having this mechanism in base?

( seekmark is reset in Java, which is starting to make more sense)

JeffBezanson · 2013-07-12T07:05:23Z

I think I can agree that mark/seekmark is much better than peek or ungetc. The latter are clearly limited to single bytes (maybe characters?), and it's not so clear how to handle errors in peek.

JeffBezanson · 2013-07-12T19:10:00Z

lockmark etc. aren't necessary. Can't mark just check that there is no existing mark? That would also make it easier to extend to multiple marks in the future (though not sure if that's a good idea).

kmsquire · 2013-07-12T19:32:55Z

One step ahead of you (barely). (I did leave it WIP for a reason! :-)

The locks are necessary (or at least a good idea) if you want to use marking with the do syntax: you probably don't want someone to close the file or mess with the marks while inside the do block.

But, really, that syntax is just sugar, and it complicates things unnecessarily, so I just removed it. The current request is pretty minimal.

I'm going to squash, and actually change it to RFC--let me know if you have additional feedback.

kmsquire · 2013-07-12T19:42:39Z

I hesitated to pollute the Base namespace with useful names, but given the typical use cases, I decided that I liked reset() better than seekmark(). I can change it back if that's desired.

kmsquire · 2013-07-12T19:49:57Z

Should reset() remove the mark? For IOBuffer, forgetting to unmark() will cause the buffer to grow.

kmsquire · 2013-07-12T20:18:33Z

Updates:

reset() now removes the mark--this seemed safest
mark() throws an error if the stream is already marked

StefanKarpinski · 2013-07-12T21:34:45Z

What about this following scenario... You have code that sets a mark and then calls some other routines to do processing on the stream. The code that it calls also wants to set a mark and possibly rewind in order to do its thing. Now this will fail. I don't see why the callee or caller should care if the other sets a mark or not. It seems to me that the only way to make this work is to have a mark stack. Ideally, however, the API should not make it possible (or at least easy) to reset when you haven't marked or reset someone else's mark.

kmsquire · 2013-07-12T23:04:36Z

The "do" version of mark would make that easier, by forcing the mark to be removed at the exit of the do-block. As I noted above, I think this is best accompanied by a locking mechanism, which could be made to allow code within the block (or called from with the block) to set new marks, but not release previous marks.

However, I think that's getting a little complicated, and should probably be part of a separate pull request.

JeffBezanson · 2013-07-12T23:08:02Z

The API could be

m = mark(io)
...
reset(io, m)

where reset requires the marks to be released in stack order. This API can also be used now, just with the stack size limited to 1.

kmsquire · 2013-07-12T23:08:08Z

Another possibility is to have mark give back a token, and require that token when resetting the current mark.

kmsquire · 2013-07-12T23:12:01Z

Okay, that would work. It would have to be a token and not a location, at least for IOBuffer. The buffered data can be moved in the buffer (in compact), causing the location of all tokens to be moved.

kmsquire · 2013-07-20T16:07:17Z

I finally got around to finishing this up. Marks are implemented as a stack, per @StefanKarpinski and @JeffBezanson's suggestion, and users are required to keep track of mark tokens, as in @JeffBezanson's example above. The patch includes docs and tests.

The tests pass on my system. Travis seems to have failed because of a missing ncurses dependency.

If there are no additional comments or suggestions, I think this is good to merge.

kmsquire · 2013-07-20T16:11:49Z

Whoops, no, it was my fault. Forgot to add a file when I was cleaning things up. Fixed in a sec...

kmsquire · 2013-07-21T18:17:12Z

Travis gcc build passed, clang failed... not clear to me why.

vtjnash · 2013-08-07T19:45:19Z

bump. lgtm. can this be merged now? (travis build error was unrelated)

Keno · 2013-08-07T19:51:24Z

Fine with me.

kmsquire · 2013-08-08T05:14:36Z

I'll rebase in the morning.

kmsquire · 2013-08-08T19:24:21Z

Rebased. It would be good to hear from @StefanKarpinski and/or @JeffBezanson before merging, as they have expressed the strongest opinions about previous versions (and the current implementation is based on their suggestions). Otherwise, this should be good to go.

StefanKarpinski · 2013-08-08T19:34:48Z

As long as this allows you to stack marks, I defer to @JeffBezanson's opinion.

kmsquire · 2013-08-08T19:44:09Z

Marks are stacked, although unlike your original proposal, reseting a mark removes all later marks. I did this under the assumption that later marks would most likely have been added by called functions (or inner code blocks), which would have exited by the time reset is called. It also simplified the code if I assumed that later marks are always identical to or greater than earlier marks. These assumptions could be removed, at the cost of complicating the code some.

StefanKarpinski · 2014-06-16T18:17:44Z

Please stop adding changes to 0.3!

kmsquire · 2014-06-17T23:20:49Z

Rebased. Although I would love to see this merged (if only so I can stop wasting my time rebasing), I can also sympathize with @StefanKarpinski's desire not to muck with the v0.3 requirements.

So if this should go in, please merge it (when travis passes, which it should). If it should not go in, please remove the v0.3 tag, and we can merge as soon as v0.4 opens.

StefanKarpinski · 2014-06-18T00:21:51Z

There's also a complementary functionality to consider: cork and uncork – corking an output stream would prevent things printed to it from being flushed, buffering output until an uncork. This can be useful for making sure that output is contiguous. Not sure if it makes sense to think about these together. There seems to be some relation.

JeffBezanson · 2014-06-20T01:26:07Z

I want to merge this but I find it very hard to stomach allocating an extra array for every IOBuffer. That's a whole extra garbage array for every string you build.

JeffBezanson · 2014-06-20T01:37:05Z

Isn't the ability to remove arbitrary marks (with splice!) dangerous? It silently shifts the meaning of all later marks. If the common case is that there are no later marks, why not only allow removing the last mark?

vtjnash · 2014-06-20T01:38:50Z

good point. would it be worth leaving this field uninitialized until you use it for the first time? or using tuples?

vtjnash · 2014-06-20T02:00:15Z

it removes all marks to the end, thereby invalidating all later marks (although it doesn't have a way to truly validate them later)

JeffBezanson · 2014-06-20T02:37:01Z

Ah, of course. I find it strange to remove arbitrary marks; it's just hard to see when I would use that.

kmsquire · 2014-06-20T05:52:56Z

I want to merge this but I find it very hard to stomach allocating an extra array for every IOBuffer. That's a whole extra garbage array for every string you build.

That's fair. There are also a lot of other things in IOBuffer which are not applicable to strings (although they don't cost as much). Why not make string building a separate class, and merge this after that? Not for 0.3, of course, as useful as this would be.

vtjnash · 2014-06-20T05:56:54Z

a few extra Bools and an Int allocated inline is much less than an Array object

kmsquire · 2014-06-20T06:07:13Z

I find it strange to remove arbitrary marks; it's just hard to see when I would use that.

It's been a while since I implemented this, but if you read back through the comments above, the current implementation was loosely based on #3656 (comment).

Originally, there was only one mark--there's no reason it couldn't be changed back to that.

In theory, the current mechanism could be used for complicated or multilevel header parsing, as in multimedia files, or perhaps as input to backtracking parsers. But these are more obscure, and no one is clamoring for these features.

kmsquire · 2014-06-20T07:30:06Z

a few extra Bools and an Int allocated inline is much less than an Array object

That's true. But at least if string building was a separate entity, the extra array allocation in this PR wouldn't affect string building.

Anyway, I just took out the array and replaced it with another inline Bool and Int (so only 1 mark is allowed). Will resubmit shortly.

kmsquire · 2014-06-20T08:34:26Z

Okay, I reverted back to 1 mark per IO stream, maximum (and stashed the previous version in a different branch).

I started to rename the functions to mark!, unmark!, and reset!, but this really didn't jive with the rest of the IO commands (especially seek). So this is just like the previous PR, but no extra array in each IOBuffer, and only one mark per IO.

JeffBezanson · 2014-06-20T13:38:14Z

I think I see now; it's a stack of marks but you can remove any number from the end for something like exception unwinding.

To me it's nice for I/O code to also be string building code. In fact it's hard to distinguish them, since we often write text-producing code using calls to print, and the same code is used for I/O and building strings. It would be unfortunate if there were 2 ways to do it, one of which is slightly faster.

Maybe this is a case where a linked list is appropriate. You wouldn't expect the mark stack to get very deep, and we could incrementally add one tiny object per mark.

kmsquire · 2014-06-20T19:26:37Z

It could certainly be done with a linked-list. The behavior and logic would be different (and I think more complicated) than the array implementation.

At any rate, I won't have time to work on this soon. The current version allows just one mark, with just one extra field (an Int holding the marked position, negative if no mark).

I think this is still useful, and generic enough that it can be extended easily in the future.

@JeffBezanson, if you agree (and travis is green), take a quick look and merge if all is good.

If not, please remove the 0.3 tag, and we can return to this in 0.4.

(If anyone else wants to update this to use a linked list, feel free.)

JeffBezanson · 2014-06-22T04:58:29Z

doc/stdlib/base.rst

+
+.. function:: unmark(s)
+
+   Remove a mark from stream ``s``.  Throws an error if ``s`` is not marked.


The second sentence is wrong; looks like it can just be removed.

Also added show() for IOBuffer. * Define mark, reset, unmark, ismarked for IO only. This is slightly sketchy, as it assumes that IO.mark exists, but is perhaps better than duplicating the code for all subtypes.

kmsquire · 2014-06-25T04:27:42Z

Okay, I've redefined mark, unmark, reset, and ismarked for IO, to remove the duplication and in response to @JeffBezanson's comment to that effect. (As he pointed out, it is slightly sketchy, but it's possibly better than duplicating code.)

I also fixed the unmark docs.

I considered creating a simple interface (ala graphics) which required IO subtypes to implement
getmark and setmark!, and ismarked, and use these in the current functions. This would actually be cleaner, but wouldn't be much different than the previous interface which duplicated the code for mark, reset, etc., since these would still have to be implemented for each subtype. But it could be done.

RFM: mark/reset for IOStream, IOBuffer, & AsyncStream (addresses #2638)

kmsquire · 2014-07-02T04:47:47Z

Thanks!

On Monday, June 30, 2014, Jeff Bezanson [email protected] wrote:

Merged #3656 #3656.

—
Reply to this email directly or view it on GitHub
#3656 (comment).

coveralls · 2014-10-28T23:09:48Z

Changes Unknown when pulling 990f8f1 on kmsquire:buffered_reader into * on JuliaLang:master*.

@IanButterworth

Stdlib: Pkg URL: https://github.com/JuliaLang/Pkg.jl.git Stdlib branch: master Julia branch: master Old commit: b02fb9597 New commit: ffb6edf03 Julia version: 1.11.0-DEV Pkg version: 1.11.0 Bump invoked by: @IanButterworth Powered by: [BumpStdlibs.jl](https://github.com/JuliaLang/BumpStdlibs.jl) Diff: JuliaLang/Pkg.jl@b02fb95...ffb6edf ``` $ git log --oneline b02fb9597..ffb6edf03 ffb6edf03 cache pidlock tweaks (#3654) 550eadd7e Pin registry for MetaGraph tests (#3666) ee39026b8 Remove test that depends on Random being in the sysimg (#3656) 561508db2 CI: Increase the CI timeout. Update actions. Fix double precompilation. (#3665) 7c7ed63b1 Remove change UUID script it should be uncessary on Julia v1.11-dev (#3655) a8648f7c8 Precompile: Fix algorithmic complexity of cycle detection (#3651) 0e0cf4514 Switch datastructure Vector -> Set for algorithmic complexity (#3652) 894cc3f78 respect if load-time precompile is disabled (#3648) 3ffd1cf73 Make auto GC message use printpkgstyle (#3633) ``` Co-authored-by: Dilum Aluthge <[email protected]>

vtjnash reviewed Jul 9, 2013
View reviewed changes

vtjnash mentioned this pull request Jul 12, 2013

Undefined modules on remote results in a hanging remote call on the caller. #3680

Closed

This was referenced Jul 24, 2013

make byte order types IO wrappers #3820

Closed

refactor I/O to do more buffering #3887

Open

vtjnash added this to the 0.3 milestone Jun 16, 2014

kmsquire mentioned this pull request Jun 16, 2014

STDIN problems #6523

Closed

kmsquire mentioned this pull request Jun 18, 2014

On-demand loading of support for file formats? #7299

Closed

JeffBezanson reviewed Jun 22, 2014
View reviewed changes

Added mark/unmark/reset/ismarked to IOStream, IOBuffer, and AsyncStream

990f8f1

Also added show() for IOBuffer. * Define mark, reset, unmark, ismarked for IO only. This is slightly sketchy, as it assumes that IO.mark exists, but is perhaps better than duplicating the code for all subtypes.

JeffBezanson added a commit that referenced this pull request Jun 30, 2014

Merge pull request #3656 from kmsquire/buffered_reader

03294f8

RFM: mark/reset for IOStream, IOBuffer, & AsyncStream (addresses #2638)

JeffBezanson merged commit 03294f8 into JuliaLang:master Jun 30, 2014

kmsquire mentioned this pull request Jul 4, 2014

Feature Request: peek(io::IO, x) #2638

Closed

kmsquire deleted the buffered_reader branch August 3, 2015 14:18

tkelman referenced this pull request Dec 19, 2015

Use C types for ios_*() function calls

df5ec48


		.. function:: unmark(s)

		Remove a mark from stream ``s``. Throws an error if ``s`` is not marked.

RFM: mark/reset for IOStream, IOBuffer, & AsyncStream (addresses #2638) #3656

RFM: mark/reset for IOStream, IOBuffer, & AsyncStream (addresses #2638) #3656

Conversation

kmsquire commented Jul 9, 2013

johnmyleswhite commented Jul 9, 2013

vtjnash Jul 9, 2013

Choose a reason for hiding this comment

kmsquire Jul 9, 2013

Choose a reason for hiding this comment

vtjnash commented Jul 9, 2013

StefanKarpinski commented Jul 9, 2013

JeffBezanson commented Jul 12, 2013

kmsquire commented Jul 12, 2013

kmsquire commented Jul 12, 2013

JeffBezanson commented Jul 12, 2013

JeffBezanson commented Jul 12, 2013

kmsquire commented Jul 12, 2013

kmsquire commented Jul 12, 2013

kmsquire commented Jul 12, 2013

kmsquire commented Jul 12, 2013

StefanKarpinski commented Jul 12, 2013

kmsquire commented Jul 12, 2013

JeffBezanson commented Jul 12, 2013

kmsquire commented Jul 12, 2013

kmsquire commented Jul 12, 2013

kmsquire commented Jul 20, 2013

kmsquire commented Jul 20, 2013

kmsquire commented Jul 21, 2013

vtjnash commented Aug 7, 2013

Keno commented Aug 7, 2013

kmsquire commented Aug 8, 2013

kmsquire commented Aug 8, 2013

StefanKarpinski commented Aug 8, 2013

kmsquire commented Aug 8, 2013

StefanKarpinski commented Jun 16, 2014

kmsquire commented Jun 17, 2014

StefanKarpinski commented Jun 18, 2014

JeffBezanson commented Jun 20, 2014

JeffBezanson commented Jun 20, 2014

vtjnash commented Jun 20, 2014

vtjnash commented Jun 20, 2014

JeffBezanson commented Jun 20, 2014

kmsquire commented Jun 20, 2014

vtjnash commented Jun 20, 2014

kmsquire commented Jun 20, 2014

kmsquire commented Jun 20, 2014

kmsquire commented Jun 20, 2014

JeffBezanson commented Jun 20, 2014

kmsquire commented Jun 20, 2014

JeffBezanson Jun 22, 2014

Choose a reason for hiding this comment

kmsquire commented Jun 25, 2014

kmsquire commented Jul 2, 2014

coveralls commented Oct 28, 2014