Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize the newline_list to avoid recomputing it again later #2381

Merged
merged 4 commits into from
Feb 14, 2024

Conversation

eregon
Copy link
Member

@eregon eregon commented Feb 8, 2024

@eregon eregon requested a review from kddnewton February 8, 2024 16:51
@eregon eregon force-pushed the serialize-newline-list branch 4 times, most recently from b88ef1d to b836f03 Compare February 8, 2024 18:14
@eregon
Copy link
Member Author

eregon commented Feb 8, 2024

Should be green and ready for review now.

Copy link
Collaborator

@enebo enebo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something bothers me that we are putting Source into ParseResult. I feel like we should consider only putting LineOffsets into it. Our end goal is to not have any source by the time we get our serialized source back. We just have not made it to that point yet.

If this is true then Source is nothing more than startLine and source bytes[]. Do we actually need startLine or could this be boiled into the linecalcs as well? I did not read that but I wonder if we ever need that value after parsing is over. If so then Source can go away as well.

@enebo
Copy link
Collaborator

enebo commented Feb 8, 2024

@eregon Also simple error in build-java failing in CI but you obviously will fix that.

@eregon
Copy link
Member Author

eregon commented Feb 8, 2024

@enebo I think it's a lot cleaner and less error-prone this way, and it's also exactly how it works in the Ruby API (how Prism.parse takes a String of code and returns a ParseResult which includes the Prism::Source).
The main concern there is that if someone creates a Source object they would end up with invalid lineOffsets, unless the same object is also passed to Loader.
If the Source object is created by Loader it's not possible to have this issue.
Also creating the Source object in the caller felt awkward, I'd rather Loader takes just a (byte[] serialize, byte[] sourceBytes).

Do we actually need startLine or could this be boiled into the linecalcs as well?

It cannot be stored in the lineOffsets (which are offsets in the byte[]/char* source, so 0-based), it needs to be a separate field.

Regarding once we don't need the source bytes anymore for Loader, I think that's fine, then one would be able to just pass null for the sourceBytes (or have an overload doing that). And then the Source object would only contain startLine + lineOffsets, which seems all good (well except the name maybe, but its main purpose is to provide source line & column info).
In fact the only usage for bytes in the Source Java class currently is bytes.length, which we could do another way or avoid needing entirely in findLine().

It is also something this PR helps significantly, since the lineOffsets now come from the serialized form and there is no longer a need to use the sourceBytes to find them.

If it's problematic later for some unforeseen we can of course always rework this, it's pretty easy to change, b836f03 is a pretty small commit.

Are you OK with that?
If not I can remove that commit, but IMO it is significantly cleaner this way and also more convenient to call Loader. I was confused for a while how startLine was set before (since the value passed to the constructor was basically ignored), if the Loader creates the Source there is no hidden magic and it's obvious.

@eregon eregon force-pushed the serialize-newline-list branch from b836f03 to 1fac105 Compare February 8, 2024 20:48
@eregon eregon requested a review from enebo February 8, 2024 20:49
@enebo
Copy link
Collaborator

enebo commented Feb 8, 2024

@eregon yeah I forgot we are just getting an index as well so I did not think through the startLine comment.

Will you be keeping the source? I don't see anyone doing this with location fields removed but for FFI consumers who have all the location data they probably do want the source (although they do already have it).

I am ok though with landing this. We can argue out how we deal with source bytes once we actually don't need it for anything. I am isolating this code outside the JRuby codebase so these APIs can change as much as they want and it won't require a new JRuby release. Just a new gem release containing a newly made provider.

@enebo
Copy link
Collaborator

enebo commented Feb 8, 2024

@eregon I am only approving Java changes as I did not look very closely at the rest.

@eregon
Copy link
Member Author

eregon commented Feb 9, 2024

Will you be keeping the source?

Until TruffleRuby can cache the serialized form on disk (TBD where/how, not for the 24.0 release) it needs to read the source as a byte[] anyway to pass it to parseAndSerialize().
(and of course currently the source bytes are needed for Loader so it's not yet possible to skip reading the source file)

I am isolating this code outside the JRuby codebase so these APIs can change as much as they want and it won't require a new JRuby release. Just a new gem release containing a newly made provider.

Don't you vendor a specific version/commit of Prism in JRuby when using it as the parser for Ruby code?
That's what TruffleRuby and CRuby do. Then it doesn't matter which version(s) of Prism are installed as gems, it's independent.

Or do you also use the Java Loader when Prism is used as a gem? I wouldn't expect that since what's tested in CI on JRuby is the Ruby API using FFI (so not using the Java Loader).

@enebo
Copy link
Collaborator

enebo commented Feb 9, 2024

Don't you vendor a specific version/commit of Prism in JRuby when using it as the parser for Ruby code? That's what TruffleRuby and CRuby do. Then it doesn't matter which version(s) of Prism are installed as gems, it's independent.

Or do you also use the Java Loader when Prism is used as a gem? I wouldn't expect that since what's tested in CI on JRuby is the Ruby API using FFI (so not using the Java Loader).

I am isolating the Java part as an artifact and it is included into a Ruby gem which is effectively a "barely" fork of prism gem that has been renamed. That gem includes the artifact and will compile the source with the ENVs we want when the gem is installed (later we will probably pre-compile to remove compile requirement on main platforms).

The rationale for doing this is if there is a CVE or just a good fix there is a high likelihood I only have to do a gem release and not a full JRuby release.

@eregon
Copy link
Member Author

eregon commented Feb 12, 2024

I see, makes sense. Could be worth documenting at https://github.com/ruby/prism/blob/main/docs/build_system.md#building-prism-as-part-of-jruby
We cannot do this for TruffleRuby because it needs to work for --native mode and that needs to know all Java code ahead-of-time.

@eregon
Copy link
Member Author

eregon commented Feb 12, 2024

@kddnewton Could you review & merge this?

Copy link
Collaborator

@kddnewton kddnewton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally fine with the changes, just a couple of things I'd like to see changed.

templates/src/serialize.c.erb Outdated Show resolved Hide resolved
templates/ext/prism/api_node.c.erb Outdated Show resolved Hide resolved
templates/lib/prism/serialize.rb.erb Outdated Show resolved Hide resolved
lib/prism/parse_result.rb Outdated Show resolved Hide resolved
* That way there is no half-initialized Source object visible to the caller.
* Consistent with Prism::ParseResult in Ruby.
@eregon eregon force-pushed the serialize-newline-list branch 2 times, most recently from 2e6c772 to cace09f Compare February 13, 2024 16:49
* As the user should not set these.
* Use #instance_variable_set/rb_iv_set() instead internally.
@eregon
Copy link
Member Author

eregon commented Feb 13, 2024

@kddnewton I think I addressed all your comments, could you merge?
Notably I have some changes on top of this PR so I need this to be merged to open the PR with a meaningful diff.

…ension

* Faster that way:
  $ ruby -Ilib -rprism -rbenchmark/ips -e 'Benchmark.ips { |x| x.report("parse") { Prism.parse("1 + 2") } }'
  195.722k (± 0.5%) i/s
  rb_iv_set():
  179.609k (± 0.5%) i/s
  rb_funcall():
  190.030k (± 0.3%) i/s
  before this PR:
  183.319k (± 0.4%) i/s
@eregon eregon force-pushed the serialize-newline-list branch from b2c3222 to eaf7c2f Compare February 14, 2024 13:06
@eregon eregon requested a review from kddnewton February 14, 2024 13:19
@kddnewton kddnewton merged commit d00c1bb into ruby:main Feb 14, 2024
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include the line offsets in the serialized output
3 participants