Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding postion() API to Reader (#654) #657

Merged
merged 8 commits into from
Dec 21, 2020
Merged

Adding postion() API to Reader (#654) #657

merged 8 commits into from
Dec 21, 2020

Conversation

bingkh
Copy link
Contributor

@bingkh bingkh commented Dec 2, 2020

Issue #, if available:
#654

Description of changes:
(Copied from issue description)
Hi,

Description
As a user of ion-js, I would love to be able to get position information from ionReader, so that when using ionReader to read a ion file, I can know which position I'm currently at.

Application scenario
Our team uses ion to define a format, e.g. paragraph P uses certain format style S, and the definition of style S is out side of the paragraph itself. see blow:

style::{
  name:S,
  font:Arial,
  .....
}
....
....
paragraph::{
  name:A,
  style:S,
  .....
}

I'm working on a vs code extension: ion-style-peek, so that in the above ion file, when users are viewing paragraph A and call go to definition on S, the editor will jump to style S. Similar implementation can be found: https://github.com/pranaygp/vscode-css-peek

To implement the above, we need to get the position when parsing style in the ion, hence the requesting issue.

We can get position information from the below two paths:

Reader -> BinaryReader -> ParserBinaryRaw -> StringSpan
Reader -> TextReader -> ParserTextRaw -> BinarySpan

Thanks!

Ethan

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Contributor

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the PR! Some minor questions/comments below.

src/IonParserBinaryRaw.ts Show resolved Hide resolved
src/IonReader.ts Outdated
* @returns a [[number]] type presenting the position of the character the reading is
* currently reading.
*/
position(): number | null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts here:

  • What does it mean for a Reader's position to be null?

  • The number returned by position() needs to make sense for both text and binary Readers. The comment describes only the meaning for a text reader.

  • Unfortunately, the definition of a "character" is ambiguous in Unicode terms. It could refer to a code point, code unit, glyph or grapheme cluster (among other possibilities). I think our best option here is to refer to code units:

    A code unit is the unit of storage of a part of an encoded code point. In UTF-8 this means 8-bits, in UTF-16 this means 16-bits. A single code unit may represent a full code point, or part of a code point. For example, the snowman glyph (☃) is a single code point but 3 UTF-8 code units, and 1 UTF-16 code unit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the input!

  • for default value, I took a second look, and actually once the TextReader/BinaryReader is initialized, the position will be initialized as 0 in StringSpan/BinarySpan. So I guess the null actually doesn't make sense.
  • Comment updated, how about the below:
Suggested change
position(): number | null;
/**
* Return the position of the current reader.
* The position refers to the distance between the code units where the reader
* stared (e.g. the first code unit of the file), and the current code unit the
* reader is reading.
*
* A code unit is the unit of storage of a part of an encoded code point.
* Ref: https://stackoverflow.com/a/27331885/109549
*
* @returns a [[number]] type presenting the position of the code unit the reader is
* currently reading.
*/
position(): number;

Copy link
Contributor

@zslayton zslayton Dec 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment is an improvement, but we still need a couple of tweaks.

  • Because the Reader interface is used for both text and binary, we need to explain what position() means when the Reader is a binary reader. Namely, that it will be the byte offset from the start of the input rather than the code unit offset.
  • Because code units are different in different unicode encodings (UTF8 code units are 1-4 bytes, UTF16 are 2 bytes), we should specify that this method returns the number of UTF-16 code units, the encoding used by JavaScript strings.
  • Even though it's a good explanation, I'd like to avoid using a StackOverflow link in the public documentation. We can point to JavaScript's String length documentation, however. We could also refer users to chapter 2.5 of the Unicode Standard, "Encoding Forms", but that's a bit dense for casual reference.
  • We should warn users that Readers cannot safely skip to a given position in the stream and begin reading as they may be skipping over system values like symbol definitions.

What do you think of:

/**
   * Returns the Reader's offset from the beginning of its input.
   * 
   * For binary Readers, the return value is the number of bytes that have
   * been processed.
   * 
   * For text Readers, the return value is the number of UTF-16 code units
   * that have been processed, regardless of the input's original encoding.
   * For more on JavaScript's in-memory representation of text, see:
   * https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/length#Description
   * 
   * Note that a Reader cannot safely skip to a given position in input without
   * processing the stream leading up to that position. This is because there are
   * mid-stream system level values that must be processed to guarantee that the 
   * Reader is in a valid state. It is safe, however, to start at the beginning of a data
   * source and call next() until you reach the desired position, as the reader
   * will still have the opportunity to process system-level values along the way.)
   *
   * @returns the [[number]] of bytes or UTF-16 code units that the reader has processed.
  position(): number;

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed reply! This looks a lot better!
I didn't think that much when I started out, and referring to StackOverflow was really a miss 🤦🏻‍♂️
I will update the commit with the latest comment.

- Updated comment of position() API
@zslayton
Copy link
Contributor

zslayton commented Dec 9, 2020

Ok, looking good!

One other thing that I should've mentioned before (sorry! 😞): we need a couple of unit tests to prove that this works like we think it does.

Could you add some tests to test/BinaryReader.ts and test/IonTextReader.ts demonstrating the expected behavior? Some ideas:

  • Calling position() on new text readers returns 0
  • Calling position() on new binary readers returns 0 (or 4 -- they might eagerly consume the Ion Version Marker?)
  • Both formats: calling position() after calling next() returns the expected value for each of a few values in a stream
  • A text stream with some multi-byte and multi-code-unit characters returns the expected position in UTF-16 code units.

@bingkh
Copy link
Contributor Author

bingkh commented Dec 12, 2020

Tests added!
No worry, I should've added the tests from the very beginning, tests are also a good demonstration of the reader APIs.
Please let me know if there needs to be any update!

Copy link
Contributor

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! One last round of small cleanups.

test/IonTextReader.ts Outdated Show resolved Hide resolved
test/IonBinaryReader.ts Outdated Show resolved Hide resolved
test/IonTextReader.ts Outdated Show resolved Hide resolved
test/IonBinaryReader.ts Outdated Show resolved Hide resolved
@bingkh
Copy link
Contributor Author

bingkh commented Dec 15, 2020

Picked all suggestions! 👍🏼

Copy link
Contributor

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for contributing!

@zslayton zslayton merged commit 7d4bbc6 into amazon-ion:master Dec 21, 2020
@bingkh bingkh deleted the reader-position branch December 22, 2020 07:53
@bingkh
Copy link
Contributor Author

bingkh commented Apr 21, 2021

Hi team, can we know when this change can be released?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants