Serialize the newline_list to avoid recomputing it again later #2381

eregon · 2024-02-08T16:51:52Z

Fixes Include the line offsets in the serialized output #2380

Part of #2402

eregon · 2024-02-08T18:15:05Z

Should be green and ready for review now.

enebo

Something bothers me that we are putting Source into ParseResult. I feel like we should consider only putting LineOffsets into it. Our end goal is to not have any source by the time we get our serialized source back. We just have not made it to that point yet.

If this is true then Source is nothing more than startLine and source bytes[]. Do we actually need startLine or could this be boiled into the linecalcs as well? I did not read that but I wonder if we ever need that value after parsing is over. If so then Source can go away as well.

enebo · 2024-02-08T18:41:30Z

@eregon Also simple error in build-java failing in CI but you obviously will fix that.

eregon · 2024-02-08T20:44:40Z

@enebo I think it's a lot cleaner and less error-prone this way, and it's also exactly how it works in the Ruby API (how Prism.parse takes a String of code and returns a ParseResult which includes the Prism::Source).
The main concern there is that if someone creates a Source object they would end up with invalid lineOffsets, unless the same object is also passed to Loader.
If the Source object is created by Loader it's not possible to have this issue.
Also creating the Source object in the caller felt awkward, I'd rather Loader takes just a (byte[] serialize, byte[] sourceBytes).

Do we actually need startLine or could this be boiled into the linecalcs as well?

It cannot be stored in the lineOffsets (which are offsets in the byte[]/char* source, so 0-based), it needs to be a separate field.

Regarding once we don't need the source bytes anymore for Loader, I think that's fine, then one would be able to just pass null for the sourceBytes (or have an overload doing that). And then the Source object would only contain startLine + lineOffsets, which seems all good (well except the name maybe, but its main purpose is to provide source line & column info).
In fact the only usage for bytes in the Source Java class currently is bytes.length, which we could do another way or avoid needing entirely in findLine().

It is also something this PR helps significantly, since the lineOffsets now come from the serialized form and there is no longer a need to use the sourceBytes to find them.

If it's problematic later for some unforeseen we can of course always rework this, it's pretty easy to change, b836f03 is a pretty small commit.

Are you OK with that?
If not I can remove that commit, but IMO it is significantly cleaner this way and also more convenient to call Loader. I was confused for a while how startLine was set before (since the value passed to the constructor was basically ignored), if the Loader creates the Source there is no hidden magic and it's obvious.

enebo · 2024-02-08T21:40:26Z

@eregon yeah I forgot we are just getting an index as well so I did not think through the startLine comment.

Will you be keeping the source? I don't see anyone doing this with location fields removed but for FFI consumers who have all the location data they probably do want the source (although they do already have it).

I am ok though with landing this. We can argue out how we deal with source bytes once we actually don't need it for anything. I am isolating this code outside the JRuby codebase so these APIs can change as much as they want and it won't require a new JRuby release. Just a new gem release containing a newly made provider.

enebo · 2024-02-08T21:41:30Z

@eregon I am only approving Java changes as I did not look very closely at the rest.

eregon · 2024-02-09T10:10:00Z

Will you be keeping the source?

Until TruffleRuby can cache the serialized form on disk (TBD where/how, not for the 24.0 release) it needs to read the source as a byte[] anyway to pass it to parseAndSerialize().
(and of course currently the source bytes are needed for Loader so it's not yet possible to skip reading the source file)

I am isolating this code outside the JRuby codebase so these APIs can change as much as they want and it won't require a new JRuby release. Just a new gem release containing a newly made provider.

Don't you vendor a specific version/commit of Prism in JRuby when using it as the parser for Ruby code?
That's what TruffleRuby and CRuby do. Then it doesn't matter which version(s) of Prism are installed as gems, it's independent.

Or do you also use the Java Loader when Prism is used as a gem? I wouldn't expect that since what's tested in CI on JRuby is the Ruby API using FFI (so not using the Java Loader).

enebo · 2024-02-09T15:35:17Z

Don't you vendor a specific version/commit of Prism in JRuby when using it as the parser for Ruby code? That's what TruffleRuby and CRuby do. Then it doesn't matter which version(s) of Prism are installed as gems, it's independent.

Or do you also use the Java Loader when Prism is used as a gem? I wouldn't expect that since what's tested in CI on JRuby is the Ruby API using FFI (so not using the Java Loader).

I am isolating the Java part as an artifact and it is included into a Ruby gem which is effectively a "barely" fork of prism gem that has been renamed. That gem includes the artifact and will compile the source with the ENVs we want when the gem is installed (later we will probably pre-compile to remove compile requirement on main platforms).

The rationale for doing this is if there is a CVE or just a good fix there is a high likelihood I only have to do a gem release and not a full JRuby release.

eregon · 2024-02-12T08:53:19Z

I see, makes sense. Could be worth documenting at https://github.com/ruby/prism/blob/main/docs/build_system.md#building-prism-as-part-of-jruby
We cannot do this for TruffleRuby because it needs to work for --native mode and that needs to know all Java code ahead-of-time.

eregon · 2024-02-12T08:53:32Z

@kddnewton Could you review & merge this?

kddnewton

Generally fine with the changes, just a couple of things I'd like to see changed.

templates/src/serialize.c.erb

templates/ext/prism/api_node.c.erb

templates/lib/prism/serialize.rb.erb

lib/prism/parse_result.rb

* Fixes ruby#2380

* That way there is no half-initialized Source object visible to the caller. * Consistent with Prism::ParseResult in Ruby.

* As the user should not set these. * Use #instance_variable_set/rb_iv_set() instead internally.

eregon · 2024-02-13T17:19:12Z

@kddnewton I think I addressed all your comments, could you merge?
Notably I have some changes on top of this PR so I need this to be merged to open the PR with a meaningful diff.

…ension * Faster that way: $ ruby -Ilib -rprism -rbenchmark/ips -e 'Benchmark.ips { |x| x.report("parse") { Prism.parse("1 + 2") } }' 195.722k (± 0.5%) i/s rb_iv_set(): 179.609k (± 0.5%) i/s rb_funcall(): 190.030k (± 0.3%) i/s before this PR: 183.319k (± 0.4%) i/s

eregon requested a review from kddnewton February 8, 2024 16:51

eregon force-pushed the serialize-newline-list branch 4 times, most recently from b88ef1d to b836f03 Compare February 8, 2024 18:14

enebo requested changes Feb 8, 2024

View reviewed changes

eregon force-pushed the serialize-newline-list branch from b836f03 to 1fac105 Compare February 8, 2024 20:48

eregon requested a review from enebo February 8, 2024 20:49

enebo approved these changes Feb 8, 2024

View reviewed changes

kddnewton requested changes Feb 13, 2024

View reviewed changes

templates/src/serialize.c.erb Outdated Show resolved Hide resolved

templates/ext/prism/api_node.c.erb Outdated Show resolved Hide resolved

templates/lib/prism/serialize.rb.erb Outdated Show resolved Hide resolved

lib/prism/parse_result.rb Outdated Show resolved Hide resolved

eregon added 2 commits February 13, 2024 17:31

Serialize the newline_list to avoid recomputing it again later

4eaaa90

* Fixes ruby#2380

Pass the sourceBytes instead of Source to Java Loader

6e57db7

* That way there is no half-initialized Source object visible to the caller. * Consistent with Prism::ParseResult in Ruby.

eregon force-pushed the serialize-newline-list branch 2 times, most recently from 2e6c772 to cace09f Compare February 13, 2024 16:49

Remove attr_writer's for ParseResult#start_line and #offsets

cace09f

* As the user should not set these. * Use #instance_variable_set/rb_iv_set() instead internally.

This was referenced Feb 13, 2024

Optimize the FFI backend #2402

Closed

Prefer Array.new(n) {} to n.times.map {} #2403

Merged

eregon force-pushed the serialize-newline-list branch from 50924b0 to b2c3222 Compare February 14, 2024 13:00

eregon force-pushed the serialize-newline-list branch from b2c3222 to eaf7c2f Compare February 14, 2024 13:06

eregon requested a review from kddnewton February 14, 2024 13:19

kddnewton approved these changes Feb 14, 2024

View reviewed changes

kddnewton merged commit d00c1bb into ruby:main Feb 14, 2024
54 checks passed

eregon mentioned this pull request Feb 14, 2024

Slower than CRuby while parsing files through YARP/Prism gem oracle/truffleruby#3293

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialize the newline_list to avoid recomputing it again later #2381

Serialize the newline_list to avoid recomputing it again later #2381

eregon commented Feb 8, 2024 •

edited

Loading

eregon commented Feb 8, 2024

enebo left a comment

enebo commented Feb 8, 2024

eregon commented Feb 8, 2024 •

edited

Loading

enebo commented Feb 8, 2024

enebo commented Feb 8, 2024

eregon commented Feb 9, 2024

enebo commented Feb 9, 2024

eregon commented Feb 12, 2024

eregon commented Feb 12, 2024 •

edited

Loading

kddnewton left a comment

eregon commented Feb 13, 2024 •

edited

Loading

Serialize the newline_list to avoid recomputing it again later #2381

Serialize the newline_list to avoid recomputing it again later #2381

Conversation

eregon commented Feb 8, 2024 • edited Loading

eregon commented Feb 8, 2024

enebo left a comment

Choose a reason for hiding this comment

enebo commented Feb 8, 2024

eregon commented Feb 8, 2024 • edited Loading

enebo commented Feb 8, 2024

enebo commented Feb 8, 2024

eregon commented Feb 9, 2024

enebo commented Feb 9, 2024

eregon commented Feb 12, 2024

eregon commented Feb 12, 2024 • edited Loading

kddnewton left a comment

Choose a reason for hiding this comment

eregon commented Feb 13, 2024 • edited Loading

eregon commented Feb 8, 2024 •

edited

Loading

eregon commented Feb 8, 2024 •

edited

Loading

eregon commented Feb 12, 2024 •

edited

Loading

eregon commented Feb 13, 2024 •

edited

Loading