Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CBOR stringref extension (CBORGenerator.Feature.STRINGREF) #347

Merged
merged 3 commits into from
Feb 4, 2023

Conversation

here-abarany
Copy link
Contributor

See http://cbor.schmorp.de/stringref for stringref definition. Repeated binary and text strings are assigned an index on write and a tagged integer is written when repeated. Parser respects the stringref IDs when reading data back in.

Nested tags are now supported. The previous getCurrentTag() function is still supported for backwards compatibility, and returns the first tag. Otherwise a new TagList type is used to efficiently handle multiple tags.

@cowtowncoder
Copy link
Member

cowtowncoder commented Dec 11, 2022

Ok, first of all: thank you for contributing this feature! It sounds a useful thing, CBOR implementing support something Smile has had for a while.

One thing I was wondering about: since this is a sizable thing, I was wondering if it was possible to split this in two PRs: one dealing with parsing, and second one (building on first) for generation? I realize it may be tricky wrt tests, but it would be easier to code review that way if at all possible.

Second thing: regardless of whether it's 1 or 2 PRs, one (and only) process thing we need is CLA.
It can be found from here:

https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf

and is only needed once before the first contribution is merged (one is good for all future PRs for all Jackson projects).
The usual way is to print it, fill & sign, scan/photo, email to info at fasterxml dot com.

Thank you once again, looking forward to reviewing this PR and getting it merged for 2.15

@here-abarany
Copy link
Contributor Author

One thing I was wondering about: since this is a sizable thing, I was wondering if it was possible to split this in two PRs: one dealing with parsing, and second one (building on first) for generation?

From what I understand this would require separate branches, which would be very unwieldy to manage and would require the first PR to be submitted before the second could actually be looked at. (since the second one would contain the code from the first) What I can do is split it into two commits, which would allow you to review them separately while being part of the same branch/PR.

Second thing: regardless of whether it's 1 or 2 PRs, one (and only) process thing we need is CLA.

Since I'm contributing this on company time I've forwarded this to the OSS department to make sure I get the OK to sign it. Hopefully it won't take too long to get approval.

@cowtowncoder
Copy link
Member

One thing I was wondering about: since this is a sizable thing, I was wondering if it was possible to split this in two PRs: one dealing with parsing, and second one (building on first) for generation?

From what I understand this would require separate branches, which would be very unwieldy to manage and would require the first PR to be submitted before the second could actually be looked at. (since the second one would contain the code from the first) What I can do is split it into two commits, which would allow you to review them separately while being part of the same branch/PR.

Second thing: regardless of whether it's 1 or 2 PRs, one (and only) process thing we need is CLA.

Since I'm contributing this on company time I've forwarded this to the OSS department to make sure I get the OK to sign it. Hopefully it won't take too long to get approval.

Yes, would require 2 branches and I guess that won't play nicely with PRs. Too bad but I guess it makes sense to keep things bundled.

And yes, CLA should be fine; there's CCLA available if your company prefers that.

@here-abarany
Copy link
Contributor Author

Just to give a heads up, we have a tentative agreement internally to sign the CCLA, but are trying to find the right people to sign it. I'm hoping we'll get it submitted by mid January. In the meantime I'll be out the next two weeks myself for the holidays.

@here-abarany
Copy link
Contributor Author

@cowtowncoder I finally managed to get a signature and forwarded the CCLA last Friday. I just wanted to make sure that it was received and we can continue with the PR.

@cowtowncoder
Copy link
Member

@here-abarany I can't seem to find CCLA. Was it sent to info at fasterxml dot com?

@here-abarany
Copy link
Contributor Author

@cowtowncoder it was sent to clas rather than info, which was the address in the document. I can forward it to info as well.

@cowtowncoder
Copy link
Member

@here-abarany Ok odd. I think clas should get forward to info... but yes, please try re-sending it as I don't see it.

@here-abarany
Copy link
Contributor Author

@cowtowncoder I just forwarded it now.

/**
* Stack of text and binary string references.
*/
protected Stack<StringRefList> _stringRefs = new Stack<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could probably create sub-class that has helper methods, if that simplifies calling code (will add a note where this'd help)

_streamReadContext = _streamReadContext.getParent();
return (_currToken = JsonToken.END_OBJECT);
}
return (_currToken = _decodePropertyName());
}
} else {
if (!_streamReadContext.expectMoreValues()) {
_tagValue = -1;
--_nestedDepth;
if (!_stringRefs.empty() && _stringRefs.peek().depth == _nestedDepth) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a simple sub-class of Stack would allow hiding this logic into single method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a helper class for this. I chose to use containment rather than subclass for this case as the interface is slightly different.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@here-abarany Ok yes, that is cleaner approach from OO design perspective. Although being only used internally not a big thing either way.

@@ -837,6 +978,9 @@ protected String _numberToName(int ch, boolean neg) throws IOException
// [dataformats-binary#269] (and earlier [dataformats-binary#30]),
// got some edge case to consider
if (i < 0) {
if (isStringref) {
_reportError("String reference index too large.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove trailing .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got all the added errors, let me know if I missed any.

@cowtowncoder
Copy link
Member

Ok @here-abarany -- impressive changes! I had some questions etc, although nothing major.

My main concern is actually getting this to merge against 3.0 (master). I know it's not practical to split this into smaller chunks, but one thing that might help is if white-space changes (removal of white-space from end-of-lines) could be done as a separate PR first?
I could merge that in, then into master; that'd probably remove noise slightly.

I suspect I will need help with master merge anyway but maybe little bit less.

@here-abarany
Copy link
Contributor Author

@cowtowncoder thanks for the review, I hope to push up an update later today based on your comments. I'll have two separate pushes: one to rebase onto the latest 2.15 changes and one to push up the updates.

With regards to the whitespace changes, my editor is configured to clear trailing whitespace on save. I might be able to check out to the 2.15 head, save those files, and cherry-pick my changes on top of that to reduce some of the noise. If I'm successful I'll include that with my pushes later today.

I can try merging into the master branch locally and see what happens. Hopefully everything merges cleanly, if not I'll note what issues I encountered.

@cowtowncoder
Copy link
Member

@cowtowncoder thanks for the review, I hope to push up an update later today based on your comments. I'll have two separate pushes: one to rebase onto the latest 2.15 changes and one to push up the updates.

With regards to the whitespace changes, my editor is configured to clear trailing whitespace on save. I might be able to check out to the 2.15 head, save those files, and cherry-pick my changes on top of that to reduce some of the noise. If I'm successful I'll include that with my pushes later today.

I can try merging into the master branch locally and see what happens. Hopefully everything merges cleanly, if not I'll note what issues I encountered.

It's unlikely to merge cleanly, partly due to package renaming, partly due to big API changes.
But exactly how tricky it is can vary. :)

As to whitespace, I'll try to figure out how to do that on my end for 2.15; that way you can just rebase on 2.15 head. No need for you to have to do that now I think.

@cowtowncoder
Copy link
Member

@here-abarany Ok I did white-space trimming, merged it to master (re-linking some sources that hopefully reduces conflicts -- changing Java package causes lots of pain). So you may want to merge/rebase from 2.15 now.

@here-abarany here-abarany force-pushed the 2.15 branch 2 times, most recently from 1d5d919 to ff93f22 Compare February 3, 2023 02:17
@here-abarany
Copy link
Contributor Author

@cowtowncoder I have pushed up my changes, I believe I addressed all the comment threads from the review. The first push is just a rebase, so you should be able to view just my latest changes by comparing the latest push just before this comment.

@@ -236,6 +253,11 @@ public int getMask() {
*/
protected boolean _bufferRecyclable;

/**
* Table of previously referenced text and binary strings.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... Used if StringRef functionality is enabled. // or something like that

@since 2.15

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if protected fields should be marked, I'll go through and try to add it for the fields I've added. (as well as the comment clarification noted here)


/**
* Shared string that should be used in place of _textBuffer when a string reference is used.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SInCE 2.15

@@ -47,6 +47,16 @@ public void testJsonSampleDoc() throws IOException
verifyJsonSpecSampleDoc(cborParser(data), false, true);
}

public void testJsonSampleDocStringref() throws IOException
{
byte[] data = cborDoc(new CBORFactory().enable(CBORGenerator.Feature.STRINGREF),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change this to use Builder? This merges cleanly with 3.0 where factories are immutable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've switched this and the one below to use a builder.

See http://cbor.schmorp.de/stringref for stringref definition.
When a stringref namespace is encountered, indices will be generated for
each valid binary and text string and used when referenced.

Nested tags are now supported. The previous getCurrentTag() function is
still supported for backwards compatibility, and returns the first tag.
Otherwise a new TagList type is used to efficiently handle multiple tags.
Citm citm0 = MAPPER.readValue(getClass().getResourceAsStream("/data/citm_catalog.json"),
Citm.class);
ObjectMapper mapper = new CBORMapper(
new CBORFactory().enable(CBORGenerator.Feature.STRINGREF));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, builder over factory configuration (I know, many existing tests use this, trying to move over with new code).

@cowtowncoder
Copy link
Member

@here-abarany Ok, sorry, missed couple of smaller things; now done with reviewing I think. Minor test changes requested to make it more likely things merge less uncleanly.

When enabled with the STRINGREF feature, a namespace is created for the
root element and indices are produced for valid binary and text strings.
Copying tags can be dangerous in two main cases:
1. Jackson creates its own tags, such as for BigNum or the self describe
   tag.
2. The semantic meaning of an element changes. For example, converting
   between stringrefs and no stringrefs.

In some cases it would result in duplicate tags, while in others it may
cause completely incorrect tags, such as a stringref tag applying to a
string rather than an integer. Before nested tag support was added this
could generate files that Jackson could not parse, though even with nested
tag support it may cause undefined behavior.
@here-abarany
Copy link
Contributor Author

@cowtowncoder I have pushed up a new revision to fix your latest comments.

@cowtowncoder
Copy link
Member

Thank you @here-abarany! I will pick this up again today or tomorrow, get it merged.

@cowtowncoder cowtowncoder merged commit 93f7cc3 into FasterXML:2.15 Feb 4, 2023
@cowtowncoder
Copy link
Member

Ok, phew. Merge to master went ok, almost: there is just one new failure and that's for String-Refs; BiggerDataTest.testRoundTripStringref().

I suspect it is since CBORParser.nextFieldName() methods (renamed as just nextName() in 3.0) do not all handle String-refs.
Or rather, new nextNameMatch() method doesn't -- it probably needs to, and I think jackson-databind uses that for Bean properties in 3.0.

Could you have a look if you can see how to fix this, @here-abarany ?

@cowtowncoder cowtowncoder changed the title Add support for CBOR stringref extension. Add support for CBOR stringref extension (CBORGenerator.Feature.STRINGREF) Feb 4, 2023
@cowtowncoder cowtowncoder added this to the 2.15.0 milestone Feb 4, 2023
cowtowncoder added a commit that referenced this pull request Feb 4, 2023
@here-abarany
Copy link
Contributor Author

@cowtowncoder thanks for getting it merged in, I'll see if I can take a look on Monday for the master issue.

@cowtowncoder
Copy link
Member

@cowtowncoder thanks for getting it merged in, I'll see if I can take a look on Monday for the master issue.

Thank you in advance! Hopefully it is relatively straight-forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants