-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copy & embed any string/constant pool entry in the serialized output #1537
Comments
I'm not sure about this. It makes it so that all serialization consumers now get a much bigger payload each time, even if they have the source available. This would make the JS one, for example, much bigger while it always has the source available. @enebo what do you think? |
There are two gains:
Also currently we need to hold onto the Source even after deserializing the entire Prism AST at once, for things like IntegerNode ( 😓 we should probably do something about that). That's rather suboptimal in memory usage, especially with lazy method compilation/translation (!= lazy deserialization, TruffleRuby already uses this with Prism), as then both the full AST and the source are kept for the entire program duration while maybe only a single/few methods of the file are used. |
There are two levels here to consider: loaded_tree and serialized_blob. For loaded tree the only source we depend on at this point is Integer, Float, and Rational. They need to be moved into the tree and not loaded from source. We just haven't made a decision on how to do this. Keeping source a round AND the tree is a huge memory cost over the legacy parser. I assume we 100% will be doing this soon. We are so close to not having source here. For serialization I think there is a pretty big carrot to moving constants into serialization: pre-compilation. It would be nice to eventually have a stable serialized format we can load and not bother to load from source at all (assuming it is trusted). We have plans to replace our AOT serialization format with this since Loader requires no C. As @eregon said we both do lazy method loading and whatever allows us to use as little memory as possible (also helps for startup time). Ultimately, what we would have for methods would be a slice of the constant pool and a slice of the offsets representing the method body (whether we actually slice this or keep whole serialized blob is up to experimentation). The time saved will be not loading huge portions of the tree until it is used. As some Rails commands I have see up to 80% of all method definitions loaded are not called. The last bit also mentioned by @eregon but he only mentions memory...it is faster to make this pool in C than have us create it in Java when Java is just starting up and still somewhat cold. If we have to depend on C we may as well have C do as much heavy lifting for us as we can get. @kddnewton I have said this many times now but there is a pretty large dual concern in prism and we keep trying to compromise to serve both. I think it is still working but there will always be some tradeoff so long as we use a single format. A larger serialized format is less memory for compilation but more memory for syntax consumers. A question would be how much larger would moving constant pool into serialization actually be? We will swap start+length for pool_offset + bytes in pool but for things like identifiers we will have quite a bit of sharing. So in those sorts of cases we may actually reduce size. Strings will be larger period. So, I do think it will be larger but I am unsure how much larger. |
Thanks @enebo for your thoughts here. I imagine it's okay to slow down the serialization API as it's only being used for Java and JS at the moment. The only real thing I'm concerned about here in terms of size would be strings. That could be quite a large serialized blob if the files have large strings in them. We would also need the DATA constant to be stored in the file as well, which could end up being large. If you all think it is worth it, I'm definitely game to try after we ship 3.3. |
@kddnewton yeah I am not sure what percentage of strings are within files but some files will end up much larger being pushed into serialization (although smaller than having both serialized+source). For me, this is a nice to have thing and something I definitely want but not at the cost of having prism being 100% usable. So deferring this is ok in my opinion. I would like the three numeric types to have values in the AST though. Carrying around source during build is some code I would love to delete and it will make a significant change in memory size (since we still do lazy method but just with AST+source). Removing source would get us near in memory to what we do with the legacy parser. |
@enebo I added the integer value storage to the next milestone so we'll do that. From reading I think I understand how to do that. However I am very unclear on how to do floats/rationals. I'm not tackling it today so it's not pressing, but if you have some resources I'm going to need them. |
@kddnewton an interim solution may just be for us to copy the number strings to the constant pool and then entertain the native storage later. |
True, although this would only be for the main file/script. I think Prism doesn't know about this currently, but it's a boolean we could pass in, and we'd only need Prism to serialize DATA in that case.
Yeah, this would be an easy way for now. OTOH that might be suboptimal for non-serializing use cases? I was thinking we might be able to just special-case them in Loader.java to store an extra byte[] for them (and declare that extra field in IntegerNode and FloatNode). Not the cleanest, but if the extra |
@eregon yeah I thought about us just adding stuff not in config.yml but it feels messy. If you do it though I won't complain. We just need to revisit this later when the real solutions occurs. |
I think we should copy & embed any string/constant pool entry in the serialized output and therefore not need the source at all when deserializing.
That would mean larger serialized size but no need to read the source at all which could be a nice speedup in the case the serialized form is saved on disk and newer/equal the source file.
I think the serialized size wouldn't be worse than before + eager loading constant pool and strings, it would be about the same, but gaining to not have to read the source.
That might actually be fairly easy, and would be less temporal memory than serialized + source at the same time in memory (but that's not a big deal).
This would also remove the need for accessing
owned
when serializing the constant pool: #1449 (comment)Originally posted by @eregon in #1479 (comment)
The text was updated successfully, but these errors were encountered: