Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store whole IR.Module in .bindings cache #8924

Merged
merged 42 commits into from
Feb 9, 2024

Conversation

JaroslavTulach
Copy link
Member

@JaroslavTulach JaroslavTulach commented Feb 1, 2024

Pull Request Description

Fixes #8639 by storing whole IR.Module inside of library .bindings cache and using them instead of loading caches from individual .ir files. The --compile command is modified to not produce .ir caches - they are already contained in .bindings files.

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • All code follows the
    Scala,
    Java,
    style guides.
  • All code has been tested:
    • All tests continue to pass
    • Benchmarks are good as they show 3,5% speed up. One more benchmarks run.
    • Use mmapped buffers only for "large files" like .bindings from Table or Base - restricted in c061494

@JaroslavTulach JaroslavTulach added the CI: No changelog needed Do not require a changelog entry for this PR. label Feb 1, 2024
@JaroslavTulach JaroslavTulach self-assigned this Feb 1, 2024
@JaroslavTulach JaroslavTulach marked this pull request as draft February 1, 2024 12:38
@JaroslavTulach
Copy link
Member Author

The .bindings files get larger, obviously. The original sizes were:

enso$ find built-distribution/ -type f | grep bindings$ | xargs wc -c
  30416 built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/lib/Standard/Visualization/0.0.0-dev/.enso/cache/bindings/0.0.0-dev/Standard/Visualization.bindings
   1341 built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/lib/Standard/Google_Api/0.0.0-dev/.enso/cache/bindings/0.0.0-dev/Standard/Google_Api.bindings
  52768 built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/lib/Standard/AWS/0.0.0-dev/.enso/cache/bindings/0.0.0-dev/Standard/AWS.bindings
   9655 built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/lib/Standard/Searcher/0.0.0-dev/.enso/cache/bindings/0.0.0-dev/Standard/Searcher.bindings
   4310 built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/lib/Standard/Geo/0.0.0-dev/.enso/cache/bindings/0.0.0-dev/Standard/Geo.bindings
  19415 built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/lib/Standard/Image/0.0.0-dev/.enso/cache/bindings/0.0.0-dev/Standard/Image.bindings
   8678 built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/lib/Standard/Examples/0.0.0-dev/.enso/cache/bindings/0.0.0-dev/Standard/Examples.bindings
  29898 built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/lib/Standard/Test_New/0.0.0-dev/.enso/cache/bindings/0.0.0-dev/Standard/Test_New.bindings
 926343 built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/lib/Standard/Base/0.0.0-dev/.enso/cache/bindings/0.0.0-dev/Standard/Base.bindings
 449874 built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/lib/Standard/Database/0.0.0-dev/.enso/cache/bindings/0.0.0-dev/Standard/Database.bindings
 377276 built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/lib/Standard/Table/0.0.0-dev/.enso/cache/bindings/0.0.0-dev/Standard/Table.bindings
  23754 built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/lib/Standard/Test/0.0.0-dev/.enso/cache/bindings/0.0.0-dev/Standard/Test.bindings
1933728 total

e.g. 2MB, but now the size is 66MB:

enso$ find built-distribution/ -type f | grep bindings$ | xargs wc -c
   37876 built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/lib/Standard/Searcher/0.0.0-dev/.enso/cache/bindings/0.0.0-dev/Standard/Searcher.bindings
   94060 built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/lib/Standard/Geo/0.0.0-dev/.enso/cache/bindings/0.0.0-dev/Standard/Geo.bindings
  784829 built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/lib/Standard/Image/0.0.0-dev/.enso/cache/bindings/0.0.0-dev/Standard/Image.bindings
  175137 built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/lib/Standard/Examples/0.0.0-dev/.enso/cache/bindings/0.0.0-dev/Standard/Examples.bindings
 2208525 built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/lib/Standard/Test_New/0.0.0-dev/.enso/cache/bindings/0.0.0-dev/Standard/Test_New.bindings
27094576 built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/lib/Standard/Base/0.0.0-dev/.enso/cache/bindings/0.0.0-dev/Standard/Base.bindings
16163637 built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/lib/Standard/Database/0.0.0-dev/.enso/cache/bindings/0.0.0-dev/Standard/Database.bindings
17308777 built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/lib/Standard/Table/0.0.0-dev/.enso/cache/bindings/0.0.0-dev/Standard/Table.bindings
 2694845 built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/lib/Standard/Test/0.0.0-dev/.enso/cache/bindings/0.0.0-dev/Standard/Test.bindings
66562262 total

@JaroslavTulach JaroslavTulach force-pushed the wip/jtulach/CachePerLibrary_8639 branch from 3f41ecd to e570e37 Compare February 2, 2024 10:22
@JaroslavTulach JaroslavTulach linked an issue Feb 3, 2024 that may be closed by this pull request
@JaroslavTulach
Copy link
Member Author

JaroslavTulach commented Feb 3, 2024

bb1272d shall fix #8698 as verified by another benchmark run:

empty startup

Down by 3,5% to 2600ms in empty startup (without using any Standard libraries). The startup with Standard libraries takes 1s longer - e.g. 1s is the cost of processing those libraries.

@JaroslavTulach JaroslavTulach marked this pull request as ready for review February 3, 2024 06:49
collection.concurrent.TrieMap[QualifiedName, Future[Boolean]]()

/** The thread pool that handles serialization. */
private val pool: ThreadPoolExecutor = new ThreadPoolExecutor(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR moves all the threading mangling into SerializationPool to separate the work with threads from the remaining logic inside of SerializationManager.

@JaroslavTulach JaroslavTulach added CI: Clean build required CI runners will be cleaned before and after this PR is built. and removed CI: Keep up to date Automatically update this PR to the latest develop. labels Feb 5, 2024
import java.util.logging.Level;
import org.enso.pkg.QualifiedName;

final class SerializationPool {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SerializationPool contains all the threading logic of original SerializationManager. It replaces Thread.sleep with Object.wait and it trades the parallel executor for a newSingleThreadExecutor. Otherwise the design with isSerializing and isWaitingForSerialization remains unchanged.

shutdownPools();
}
synchronized (threads) {
for (var t : threads.keySet()) {
Copy link
Member Author

@JaroslavTulach JaroslavTulach Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this fixes the problem where context.close() complains about dangling threads. Probably shutdownPool() stops the pools, but the threads are left to finish later. Sometimes the context.close() is called faster and some of the threads are not yet finished.

Originally with Thread.sleep(100) this situation wasn't happening (too frequently), but now with wait/notify it got way more common. This t.join() seems to fix that problematic situation altogether.


var meta = new ModuleCache.Metadata("hash", "code", CompilationStage.AFTER_CODEGEN.toString());
var cachedIr = module.getCache().deserialize(ensoCtx, arr, meta, null);
var cachedIr = mc.deserialize(ensoCtx, ByteBuffer.wrap(arr), meta, null);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deserialization now works with ByteBuffer, not byte[] to allow mmapped files.

* @return the read object
* @throws java.io.IOException when an I/O problem happens
*/
public static <T> Reference<T> read(ByteBuffer buf, Function<Object, Object> readResolve)
Copy link
Member Author

@JaroslavTulach JaroslavTulach Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New method to allow us to pass in mmapped buffer. The internals were already prepared for working with ByteBuffer.

public final class ModuleCache extends Cache<ModuleCache.CachedModule, ModuleCache.Metadata> {

public final class ModuleCache
implements Cache.Spi<ModuleCache.CachedModule, ModuleCache.Metadata> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache is not final class. All the cache implementations now implement the Spi interface and create instance of Cache with its create method.


import com.oracle.truffle.api.TruffleFile;

final class CacheUtils {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Methods of Cache that are more helper utilities.

@JaroslavTulach
Copy link
Member Author

I believe the code is ready to be reviewed. There seems to be some suspicious StackOverflowError in the CI logs/runs. Not sure how frequent/dangerous it can be. I haven't seen the StackOverflowError locally yet.

// Bound the waiting loop
int maxCount = 60;
int counter = 0;
while (this.hasJobsRemaining() && counter < maxCount) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log when max count was reached?

Copy link
Member Author

@JaroslavTulach JaroslavTulach Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is relict of the original code. I kept it, but I have no idea what it was trying to achieve. In my opinion one shall either:

  • abort the serialization - in such case there is executor.shutdownNow() method
  • wait for all serializations to finish - in such case there is executor.shutdown()

One calls either first or other method. I have no idea why the shutdown sequence is so complicated.

pool.abort(toQualifiedName(libraryName));
return scala.Option.empty();
} else {
pool.waitWhileSerializing(toQualifiedName(libraryName));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this safe? nothing stops context from serializing the library in between waiting and loading the cache? Not that it is likely to happen, but stil.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is as "safe" as the original implementation - I haven't changed the logic. True, the original code was probably not really multithreading ready. The fact that it was using Thread.sleep for coordination of threads speaks for itself.

I haven't attempted to make the code bulletproof. That would require bigger rewrite. Btw. can the code even be really safe when multiple processes access the caches at the same time? That might require some co-ordination with filesystem. What goals do we have for the caching infrastructure?

My current implementation implies:

  • caches are written down in a background executor thread
  • reading the caches happens in a caller thread and waits until the background thread finishes

If we want more, we should probably sit down and collect other requirements/constraints. Then rewrite the implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if the coordination of filesystem access to caches was encapsulated in a single class like SerializationPool. To replace this waitWhileSerializing method we would need something like:

public <T> T readCaches(QualifiedName name, Callbable<T> action) {...}

that would prevent serialization of name key while executing the action. I can try to redesign that.

@JaroslavTulach
Copy link
Member Author

JaroslavTulach commented Feb 6, 2024

There seems to be some suspicious StackOverflowError in the CI logs/runs...

The way to reproduce the exception is to sbt buildEngineDistribution and then:

enso$ ./built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/bin/enso --ir-caches --no-compile-dependencies --no-global-cache --compile ./built-distribution/enso-engine-0.0.0-dev-linux-amd64/enso-0.0.0-dev/lib/Standard/Google_Api/0.0.0-dev/
[ERROR] [2024-02-06T06:43:12+01:00] [org.enso.runner.Main] Unexpected internal error
org.graalvm.polyglot.PolyglotException: Resource exhausted: Stack overflow
~/NetBeansProjects/enso/enso$  

Fixed by 947824e - when deserialization yields IOException, then ensureParsed, but without using caches.

@JaroslavTulach
Copy link
Member Author

JaroslavTulach commented Feb 7, 2024

There is an error during serialization of suggestions however the exception seems to be swallowed in spite of being logged via TruffleLogger (ccing @hubertp). Trying to reveal the stacktrace with old plain good printStackTrace: 76f47a7

Update: With the help of printStackTrace (sic!) the exception got printed. Let's see if 11776a6 fixes the problem.

Copy link
Member

@Akirathan Akirathan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes seem generally good and understandable. I have just left very minor nitpicks.

@@ -179,7 +179,7 @@ class ImportResolver(compiler: Compiler) {
u.compilationStage(CompilationStage.INITIAL)
}
)
compiler.ensureParsed(mod)
compiler.ensureParsed(mod, false)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use caches to avoid this NPE.

@JaroslavTulach
Copy link
Member Author

Benchmarks continue to look fine:

100ms speedup

There continues to be 100ms speedup.

@JaroslavTulach JaroslavTulach merged commit 9a91b7b into develop Feb 9, 2024
28 checks passed
@JaroslavTulach JaroslavTulach deleted the wip/jtulach/CachePerLibrary_8639 branch February 9, 2024 03:51
@JaroslavTulach
Copy link
Member Author

JaroslavTulach commented Feb 12, 2024

First CI failures are starting to appear after integration of this change: https://github.com/enso-org/enso/actions/runs/7869109006/job/21467537020?pr=9024#step:8:974

.FileSystemException: built-distribution\enso-engine-2024.1.1-dev-windows-amd64\enso-2024.1.1-dev\lib\Standard\Base\2024.1.1-dev.enso\cache\bindings\2024.1.1-dev\Standard\Base.bindings: The process cannot access the file because it is being used by another process

I wish I knew what is the other process?

GitHub
Hybrid visual and textual functional programming. Contribute to enso-org/enso development by creating an account on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: Clean build required CI runners will be cleaned before and after this PR is built. CI: No changelog needed Do not require a changelog entry for this PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid sleep(100) when shutting down (and measuring startup) Use single cache file per library
3 participants