Adding `postion()` API to Reader (#654) #657

bingkh · 2020-12-02T18:19:47Z

Issue #, if available:
#654

Description of changes:
(Copied from issue description)
Hi,

Description
As a user of ion-js, I would love to be able to get position information from ionReader, so that when using ionReader to read a ion file, I can know which position I'm currently at.

Application scenario
Our team uses ion to define a format, e.g. paragraph P uses certain format style S, and the definition of style S is out side of the paragraph itself. see blow:

style::{
  name:S,
  font:Arial,
  .....
}
....
....
paragraph::{
  name:A,
  style:S,
  .....
}

I'm working on a vs code extension: ion-style-peek, so that in the above ion file, when users are viewing paragraph A and call go to definition on S, the editor will jump to style S. Similar implementation can be found: https://github.com/pranaygp/vscode-css-peek

To implement the above, we need to get the position when parsing style in the ion, hence the requesting issue.

We can get position information from the below two paths:

Reader -> BinaryReader -> ParserBinaryRaw -> StringSpan
Reader -> TextReader -> ParserTextRaw -> BinarySpan

Thanks!

Ethan

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

zslayton

Hi, thanks for the PR! Some minor questions/comments below.

src/IonParserBinaryRaw.ts

zslayton · 2020-12-02T19:23:46Z

src/IonReader.ts

+   * @returns a [[number]] type presenting the position of the character the reading is
+   * currently reading.
+   */
+  position(): number | null;


A few thoughts here:

What does it mean for a Reader's position to be null?

The number returned by position() needs to make sense for both text and binary Readers. The comment describes only the meaning for a text reader.

Unfortunately, the definition of a "character" is ambiguous in Unicode terms. It could refer to a code point, code unit, glyph or grapheme cluster (among other possibilities). I think our best option here is to refer to code units:

A code unit is the unit of storage of a part of an encoded code point. In UTF-8 this means 8-bits, in UTF-16 this means 16-bits. A single code unit may represent a full code point, or part of a code point. For example, the snowman glyph (☃) is a single code point but 3 UTF-8 code units, and 1 UTF-16 code unit.

Thanks for the input!

for default value, I took a second look, and actually once the TextReader/BinaryReader is initialized, the position will be initialized as 0 in StringSpan/BinarySpan. So I guess the null actually doesn't make sense.

Comment updated, how about the below:

Suggested change

position(): number | null;

/**

* Return the position of the current reader.

* The position refers to the distance between the code units where the reader

* stared (e.g. the first code unit of the file), and the current code unit the

* reader is reading.

*

* A code unit is the unit of storage of a part of an encoded code point.

* Ref: https://stackoverflow.com/a/27331885/109549

*

* @returns a [[number]] type presenting the position of the code unit the reader is

* currently reading.

*/

position(): number;

The doc comment is an improvement, but we still need a couple of tweaks.

Because the Reader interface is used for both text and binary, we need to explain what position() means when the Reader is a binary reader. Namely, that it will be the byte offset from the start of the input rather than the code unit offset.

Because code units are different in different unicode encodings (UTF8 code units are 1-4 bytes, UTF16 are 2 bytes), we should specify that this method returns the number of UTF-16 code units, the encoding used by JavaScript strings.

Even though it's a good explanation, I'd like to avoid using a StackOverflow link in the public documentation. We can point to JavaScript's String length documentation, however. We could also refer users to chapter 2.5 of the Unicode Standard, "Encoding Forms", but that's a bit dense for casual reference.

We should warn users that Readers cannot safely skip to a given position in the stream and begin reading as they may be skipping over system values like symbol definitions.

What do you think of:

/** * Returns the Reader's offset from the beginning of its input. * * For binary Readers, the return value is the number of bytes that have * been processed. * * For text Readers, the return value is the number of UTF-16 code units * that have been processed, regardless of the input's original encoding. * For more on JavaScript's in-memory representation of text, see: * https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/length#Description * * Note that a Reader cannot safely skip to a given position in input without * processing the stream leading up to that position. This is because there are * mid-stream system level values that must be processed to guarantee that the * Reader is in a valid state. It is safe, however, to start at the beginning of a data * source and call next() until you reach the desired position, as the reader * will still have the opportunity to process system-level values along the way.) * * @returns the [[number]] of bytes or UTF-16 code units that the reader has processed. position(): number;

?

Thanks for the detailed reply! This looks a lot better!
I didn't think that much when I started out, and referring to StackOverflow was really a miss 🤦🏻‍♂️
I will update the commit with the latest comment.

- Updated comment of position() API

zslayton · 2020-12-09T18:37:40Z

Ok, looking good!

One other thing that I should've mentioned before (sorry! 😞): we need a couple of unit tests to prove that this works like we think it does.

Could you add some tests to test/BinaryReader.ts and test/IonTextReader.ts demonstrating the expected behavior? Some ideas:

Calling position() on new text readers returns 0
Calling position() on new binary readers returns 0 (or 4 -- they might eagerly consume the Ion Version Marker?)
Both formats: calling position() after calling next() returns the expected value for each of a few values in a stream
A text stream with some multi-byte and multi-code-unit characters returns the expected position in UTF-16 code units.

bingkh · 2020-12-12T16:48:47Z

Tests added!
No worry, I should've added the tests from the very beginning, tests are also a good demonstration of the reader APIs.
Please let me know if there needs to be any update!

zslayton

Looking good! One last round of small cleanups.

test/IonTextReader.ts

test/IonBinaryReader.ts

test/IonTextReader.ts

test/IonBinaryReader.ts

Co-authored-by: Zack Slayton <[email protected]>

bingkh · 2020-12-15T06:49:41Z

Picked all suggestions! 👍🏼

zslayton

Looks good, thanks for contributing!

bingkh · 2021-04-21T07:50:01Z

Hi team, can we know when this change can be released?

Thanks!

Adding postion() API to Reader (#654)

3cca04f

zslayton requested changes Dec 2, 2020

View reviewed changes

Adding postion() API to Reader (#654)

7139ed7

- Updated comment of position() API

bingkh added 2 commits December 13, 2020 00:21

Added test for position() ionReader api

76b079a

Minor fix in the test file

a0c3ec7

Minor fix to the tests: moved string definition to comments

0208429

zslayton reviewed Dec 15, 2020

View reviewed changes

test/IonTextReader.ts Outdated Show resolved Hide resolved

test/IonBinaryReader.ts Outdated Show resolved Hide resolved

test/IonTextReader.ts Outdated Show resolved Hide resolved

test/IonBinaryReader.ts Outdated Show resolved Hide resolved

Minor fix in the tests for position() API, taking from suggestion

6d1d7b2

Co-authored-by: Zack Slayton <[email protected]>

bingkh and others added 2 commits December 16, 2020 00:08

Minor addition to the tests

040c9e5

Merge branch 'master' into reader-position

d22ae1e

zslayton approved these changes Dec 21, 2020

View reviewed changes

zslayton merged commit 7d4bbc6 into amazon-ion:master Dec 21, 2020

zslayton mentioned this pull request Dec 21, 2020

Add getPosition() function to Reader #654

Closed

bingkh deleted the reader-position branch December 22, 2020 07:53

popematt mentioned this pull request Oct 6, 2021

Adds missing Reader method impls in _HashReaderImpl amazon-ion/ion-hash-js#57

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding `postion()` API to Reader (#654) #657

Adding `postion()` API to Reader (#654) #657

bingkh commented Dec 2, 2020

zslayton left a comment

zslayton Dec 2, 2020

bingkh Dec 5, 2020

zslayton Dec 9, 2020 •

edited

Loading

bingkh Dec 9, 2020

zslayton commented Dec 9, 2020

bingkh commented Dec 12, 2020

zslayton left a comment

bingkh commented Dec 15, 2020

zslayton left a comment

bingkh commented Apr 21, 2021

-  position(): number | null;
+  /**
+   * Return the position of the current reader.
+   * The position refers to the distance between the code units where the reader
+   * stared (e.g. the first code unit of the file), and the current code unit the
+   * reader is reading.
+   *
+   * A code unit is the unit of storage of a part of an encoded code point.
+   * Ref: https://stackoverflow.com/a/27331885/109549
+   *
+   * @returns a [[number]] type presenting the position of the code unit the reader is
+   * currently reading.
+   */
+  position(): number;

Adding postion() API to Reader (#654) #657

Adding postion() API to Reader (#654) #657

Conversation

bingkh commented Dec 2, 2020

zslayton left a comment

Choose a reason for hiding this comment

zslayton Dec 2, 2020

Choose a reason for hiding this comment

bingkh Dec 5, 2020

Choose a reason for hiding this comment

zslayton Dec 9, 2020 • edited Loading

Choose a reason for hiding this comment

bingkh Dec 9, 2020

Choose a reason for hiding this comment

zslayton commented Dec 9, 2020

bingkh commented Dec 12, 2020

zslayton left a comment

Choose a reason for hiding this comment

bingkh commented Dec 15, 2020

zslayton left a comment

Choose a reason for hiding this comment

bingkh commented Apr 21, 2021

Adding `postion()` API to Reader (#654) #657

Adding `postion()` API to Reader (#654) #657

zslayton Dec 9, 2020 •

edited

Loading