Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI failures for ubuntu on 1.10 and nightly, probably out-of-memory issues #2441

Closed
lgoettgens opened this issue Jun 1, 2023 · 33 comments
Closed
Labels
bug Something isn't working

Comments

@lgoettgens
Copy link
Member

Describe the bug
In the past days, I noticed the test-ubuntu-1.8 job to terminated unexpectedly mid-way. A restart of the test did always succeed.

See e.g. https://github.com/oscar-system/Oscar.jl/actions/runs/5145822885/jobs/9263999579 and https://github.com/oscar-system/Oscar.jl/actions/runs/5133045620/jobs/9235131573

@lgoettgens lgoettgens added the bug Something isn't working label Jun 1, 2023
@thofma
Copy link
Collaborator

thofma commented Jun 1, 2023

After "wall of chambers" comes

Test Summary:                  | Pass  Total
K3 surface automorphism groups |   15     15
-> Testing ./AlgebraicGeometry/Surfaces/K3Auto.jl took: 154.803 seconds, 31089.35 MB

(copied from a different run), so I am not too surprised that this might get killed. Same for the second log you posted, where the next test would be

Test Summary:                                                       | Pass  Total   Time
Line bundle cohomologies and vanishing sets (set_attributes = true) |   30     30  27.9s
Test Summary:                                                        | Pass  Total   Time
Line bundle cohomologies and vanishing sets (set_attributes = false) |   30     30  24.3s
-> Testing AlgebraicGeometry/ToricVarieties/line_bundle_cohomologies.jl took: runtime 50.026 seconds + compilation 7.299 seconds + recompilation 0.000 seconds, 12774.31 MB

@benlorenz
Copy link
Member

See #2570 for some more examples of this, this does not only happen with julia 1.8 but also 1.6 and 1.10.

So far we have seen this happen:

Feel free to add more examples.

@benlorenz benlorenz changed the title Seemingly random CI failures for ubuntu-1.8 Seemingly random CI failures for ubuntu, probably out-of-memory issues Jul 19, 2023
@simonbrandhorst
Copy link
Collaborator

Is it possible to get more information? Like did the process OOM or did it take too long?
There is some randomization involved in K3Auto.jl.
I am wondering where it fails exactly. Maybe there is a bug somewhere.
Can this be reproduced using some known random seed?

@simonbrandhorst
Copy link
Collaborator

I can try to get the runtime of the test faster. But I think that more of 50% of it is already compilation time.
So how to proceed?

@benlorenz
Copy link
Member

benlorenz commented Jul 19, 2023

I think the process was killed due to an OOM situation. According to the timestamps in the log (can be turned on via the settings knob in the top right of that page) this was about 1.5 minutes after the walls of chamber testset was complete. (The total duration of that job is about 1 hour which is far away from any time limit).
Unfortunately we can't get any more information from this failure.

Maybe you can adapt the code to use a custom seeded rng, e.g. Oscar.get_seeded_rng(), then we might be able to reproduce the error if it happens again (the seed for this is fixed in the CI but one could easily run the test in a loop with different seeds to find bad examples).

For the default julia rng knowing the seed would probably not really help as there might be an unknown + non-deterministic number of other uses which change the rng.

@HereAround
Copy link
Member

I cannot say much towards the detailed cause or a possible resolution. However, I can share a few insights onto AlgebraicGeometry/ToricVarieties/line_bundle_cohomologies.jl, which seems to be the place where this segfault is triggered quite often. (In fact, I experienced this myself the other day and considered it a fluke. At that time however, I was not aware of this issue yet.)

  • To the best of my knowledge, the tests in AlgebraicGeometry/ToricVarieties/line_bundle_cohomologies.jl have not been changed in the last couple of months. I would thus think that this either has always been a problem and we did not notice it, or that something else changed more recently and is now causing this.
  • On the julia side, the computations are rather elementary. Their complexity is not more (but also not less) than many of the other tests in toric geometry that we execute before the line bundle cohomologies.
  • What sets these tests apart though, it that we use the external software cohomCalg to execute a few tasks. This is cpp software (cf. https://github.com/BenjaminJurke/cohomCalg) and @fingolfin wrapped this for me about 2 years (?) back. There are some complex combinatoric tasks execute by this software, which can become fairly challenging. However, the examples chosen in this test file are sufficiently elementary, that I would like to say these computations must be executable on essentially any computer.

As said, I am not sure about the cause. But maybe this information helps @benlorenz to guess/track down the cause for this?

@benlorenz
Copy link
Member

benlorenz commented Jul 20, 2023

The original report in this ticket was from the first of June, the last change in the line_bundle_cohomologies test file was just a week earlier: #2396

Please note that this is not a segfault but the process seems to run out of memory.
Since julia reports about 12GB of allocations for that test-file the memory-intensive code seems to use julia memory allocations, i.e. it is probably not cohomCalg.
A good start might be to check which of the test-groups in that file need a lot of allocations, e.g. using @time?

@benlorenz
Copy link
Member

benlorenz commented Jul 20, 2023

Another new case:

-> Testing AlgebraicGeometry/Schemes/CoherentSheaves.jl took: 78.817 seconds, 5119.89 MB

signal (15): Terminated
in expression starting at none:1
epoll_wait at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)

from the OscarCI tests in GAP: https://github.com/oscar-system/GAP.jl/actions/runs/5608123338/jobs/10260113884#step:7:1471

The next testgroup after CoherentSheaves.jl would be elliptic_surface.jl which also needs quite a lot of memory allocations:

-> Testing AlgebraicGeometry/Schemes/CoherentSheaves.jl took: runtime 11.783 seconds + compilation 74.945 seconds + recompilation 0.000 seconds, 3271.67 MB
Test Summary:     | Pass  Total     Time
elliptic surfaces |    2      2  5m10.8s
-> Testing AlgebraicGeometry/Schemes/elliptic_surface.jl took: runtime 151.211 seconds + compilation 159.639 seconds + recompilation 0.000 seconds, 15826.34 MB

HereAround added a commit to HereAround/Oscar.jl that referenced this issue Jul 20, 2023
@HereAround
Copy link
Member

HereAround commented Jul 20, 2023

@benlorenz I stand corrected. Thank you for reminding me of this.

I could narrow this down to a single line. This line would compute all monomials of a certain degree in a certain graded ring (a.k.a. a homogeneous_component). I executed AlgebraicGeometry/ToricVarieties/line_bundle_cohomologies.jl with @time. Here is what I find:

  • With said homogeneous component computation: 42.267645 seconds (277.45 M allocations: 12.313 GiB, 37.02% gc time)
  • Without the homogeneous component computation: 3.380005 seconds (485.06 k allocations: 15.583 MiB)

So a fix could be to just remove this line from the tests. A corresponding PR is here: #2579.

Alternatively, we could of course dive deeper and see if the homogeneous component computation could be improved. If the latter is interesting, it can be reproduced as follows:

julia> ray_generators = [[1, 0, 0,-2,-3], [0, 1, 0,-2,-3], [0, 0, 1,-2,-3], [-1,-1,-1,-2,-3], [0, 0, 0, 1, 0], [0, 0, 0, 0, 1], [0, 0, 0,-2,-3]];

julia> max_cones = [[1, 2, 3, 5, 6], [1, 2, 3, 5, 7], [1, 2, 3, 6, 7], [2, 3, 4, 5, 6], [2, 3, 4, 5, 7], [2, 3, 4, 6, 7], [1, 3, 4, 5, 6], [1, 3, 4, 5, 7], [1, 3, 4, 6, 7], [1, 2, 4, 5, 6], [1, 2, 4, 5, 7], [1, 2, 4, 6, 7]];

julia> weierstrass_over_p3 = normal_toric_variety(ray_generators, max_cones; non_redundant = true);

julia> R = cox_ring(weierstrass_over_p3)
Multivariate polynomial ring in 7 variables over QQ graded by 
  x1 -> [1 0]
  x2 -> [1 0]
  x3 -> [1 0]
  x4 -> [1 0]
  x5 -> [0 2]
  x6 -> [0 3]
  x7 -> [-4 1]

Of this ring R, I wish to compute all monomials of degree (0,4). This is a vector space of dimension 4551. Specifically, for this, I would have executed the following commands:

julia> l4 = anticanonical_bundle(weierstrass_over_p3);

julia> b = basis_of_global_sections(l4);

julia> length(b) == 4551
true

Let me cc @wdecker regarding the computation of homogeneous components, just in case.

@simonbrandhorst
Copy link
Collaborator

The elliptic surface stuff is a really complex data structure and computation. We worked hard to get this test running in 2 minutes. I am afraid that the test cannot be simplified much further. Plus it takes more time for compilation than for computation right now. Not sure what can be done about the allocations though. There is probably room for improvement all over the place.

@benlorenz
Copy link
Member

benlorenz commented Jul 20, 2023

Alternatively, we could of course dive deeper and see if the homogeneous component computation could be improved. If the latter is interesting, it can be reproduced as follows:

julia> ray_generators = [[1, 0, 0,-2,-3], [0, 1, 0,-2,-3], [0, 0, 1,-2,-3], [-1,-1,-1,-2,-3], [0, 0, 0, 1, 0], [0, 0, 0, 0, 1], [0, 0, 0,-2,-3]];

julia> max_cones = [[1, 2, 3, 5, 6], [1, 2, 3, 5, 7], [1, 2, 3, 6, 7], [2, 3, 4, 5, 6], [2, 3, 4, 5, 7], [2, 3, 4, 6, 7], [1, 3, 4, 5, 6], [1, 3, 4, 5, 7], [1, 3, 4, 6, 7], [1, 2, 4, 5, 6], [1, 2, 4, 5, 7], [1, 2, 4, 6, 7]];

julia> weierstrass_over_p3 = normal_toric_variety(ray_generators, max_cones; non_redundant = true);

julia> R = cox_ring(weierstrass_over_p3)
Multivariate polynomial ring in 7 variables over QQ graded by 
  x1 -> [1 0]
  x2 -> [1 0]
  x3 -> [1 0]
  x4 -> [1 0]
  x5 -> [0 2]
  x6 -> [0 3]
  x7 -> [-4 1]

Of this ring R, I wish to compute all monomials of degree (0,4). This is a vector space of dimension 4551. Specifically, for this, I would have executed the following commands:

julia> l4 = anticanonical_bundle(weierstrass_over_p3);

julia> b = basis_of_global_sections(l4);

julia> length(b) == 4551
true
julia> b[1:5]
5-element Vector{MPolyDecRingElem{QQFieldElem, QQMPolyRingElem}}:
 x6^2
 x5^3
 x4^4*x5*x6*x7
 x4^8*x5^2*x7^2
 x4^12*x6*x7^3

This looks like degree (0,6) to me?

To me this feels like integer points in a polyhedron given by some non-negative constraints and two equations:

exps = solve_non_negative(ZZMatrix, matrix(ZZ, [1 1 1 1 0 0 -4; 0 0 0 0 2 3 1]), matrix(ZZ, 2,1, [0,4]));

julia> nrows(exps)
1170

julia> exps[1:5,:]
[0   0   0    0   2   0   0]
[0   0   0    4   0   1   1]
[0   0   0    8   1   0   2]
[0   0   0   16   0   0   4]
[0   0   1    3   0   1   1]

Note that here we have x5^2 (instead of x6^2) which fits to (0,4).

If you really wanted (0,6) you could easily change this and get the original number back:

julia> @time exps = solve_non_negative(ZZMatrix, matrix(ZZ, [1 1 1 1 0 0 -4; 0 0 0 0 2 3 1]), matrix(ZZ, 2,1, [0,6]));
  0.182823 seconds (3.52 M allocations: 39.347 MiB, 12.41% gc time)

julia> nrows(exps)
4551

@lgoettgens
Copy link
Member Author

Feel free to add more examples.

@simonbrandhorst
Copy link
Collaborator

I think the process was killed due to an OOM situation. According to the timestamps in the log (can be turned on via the settings knob in the top right of that page) this was about 1.5 minutes after the walls of chamber testset was complete. (The total duration of that job is about 1 hour which is far away from any time limit). Unfortunately we can't get any more information from this failure.

Maybe you can adapt the code to use a custom seeded rng, e.g. Oscar.get_seeded_rng(), then we might be able to reproduce the error if it happens again (the seed for this is fixed in the CI but one could easily run the test in a loop with different seeds to find bad examples).

For the default julia rng knowing the seed would probably not really help as there might be an unknown + non-deterministic number of other uses which change the rng.

Yes it could be an OOM. I did 200 test runs on a server and each run took between 37-43.7 seconds. But why do we OOM? Does the bot have so little memory? Or do we leak and it piles up?

@ThomasBreuer
Copy link
Member

Concerning

@testset "quo for FPGroup" in test/Groups/homomorphisms.jl

the tests that takes most of the reported 5.8 seconds are @test_throws ErrorException quo(h, [x^10]) and @test_throws ErrorException quo(h, n). Their aim is to check whether GAP still runs into an error in this situation; once this is no longer the case, some Oscar code may be changed. I do not know a faster way do do this.

@benlorenz
Copy link
Member

There is really something bad going on with 1.10 and nightly, previously this did happen infrequently, but since a few days (at most one week) this is happening all the time.

I started a matrix job for 1.10 and nightly with GC logging and got this:

JIT:          62.053 MiB
Max. RSS:   6358.074 MiB
Free mem:    561.785 MiB
Free pmem:   561.785 MiB
Test Summary:        | Pass  Total   Time
Experimental.gmodule |   10     10  31.7s
GC: pause 761.52ms. collected 2320.698616MB. incr 
Heap stats: bytes_mapped 3840.94 MB, bytes_resident 2748.09 MB, heap_size 3517.94 MB, heap_target 540816867.94 MB, live_bytes 18379.10 MB
, Fragmentation 5.224Test Summary:        | Pass  Total   Time
Experimental LocalH2 |   75     75  40.3s

[2046] signal (15): Terminated
in expression starting at /home/runner/work/_actions/julia-actions/julia-runtest/latest/test_harness.jl:7
Error: The operation was canceled.

Log: https://github.com/oscar-system/Oscar.jl/actions/runs/5689868672/job/15422159399#step:6:3171

There are plenty more failed runs in this job.
Regarding the stats: I have no idea how live_bytes 18379.10 MB is possible since the runners have about 7GB of RAM, and the heap_target 552409759.83 MB is quite suspicious...

@benlorenz
Copy link
Member

I noticed one further thing in the GC output. In my first run (e.g this one) of these 16 jobs for every one of them the GC live_bytes size jumps up at the end of the GAP OscarInterface tests, like this:

Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1286.89 MB, heap_size 1832.69 MB, heap_target 2357.69 MB, live_bytes 1761.48 MB
, Fragmentation 0.961GC: pause 898.46ms. collected 30.505936MB. incr 
Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1286.89 MB, heap_size 1832.31 MB, heap_target 2357.31 MB, live_bytes 1778.76 MB
, Fragmentation 0.971GC: pause 320.62ms. collected 552.570595MB. incr 
Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1317.08 MB, heap_size 2221.08 MB, heap_target 869387521.08 MB, live_bytes 1748.25 MB
, Fragmentation 0.787   39156 ms (1847 ms GC) and 392MB allocated for alnuth/polynome.tst
-----------------------------------
total     86501 ms (5572 ms GC) and 416MB allocated
              0 failures in 4 files

#I  No errors detected while testing

GC: pause 148.70ms. collected 258.622101MB. incr 
Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1307.08 MB, heap_size 2238.83 MB, heap_target 2851.33 MB, live_bytes 3159.52 MB
, Fragmentation 1.411GC: pause 211.27ms. collected 104.489433MB. full recollect
Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1307.08 MB, heap_size 1900.67 MB, heap_target 2469.42 MB, live_bytes 9410.99 MB
, Fragmentation 4.951GC: pause 901.04ms. collected 7.217934MB. incr 
Heap stats: bytes_mapped 1728.42 MB, bytes_resident 1307.00 MB, heap_size 1900.50 MB, heap_target 2469.25 MB, live_bytes 9306.50 MB
, Fragmentation 4.897GC live:    9299.284 MiB
JIT:          19.680 MiB
Max. RSS:   3675.008 MiB
Free mem:   1472.082 MiB
Free pmem:  1472.082 MiB
GC live:    9299.346 MiB
JIT:          19.680 MiB
Max. RSS:   3675.008 MiB
Free mem:   1472.082 MiB
Free pmem:  1472.082 MiB

In some cases, like above, there is also the weird huge heap_target value, often just once.

But disabling these test did not seem to help, the new job also failed for all 16 runs: https://github.com/oscar-system/Oscar.jl/actions/runs/5690757887

@gbaraldi
Copy link

The GC uses some allocation heuristics and GC heuristics to decide the next target and they are usually pretty good, but it turns out sometimes something happens that makes them go crazy.

@fingolfin
Copy link
Member

I've now submitted JuliaLang/julia#50705 to ensure upstream is aware and can add this to the 1.10 milestone

@fingolfin
Copy link
Member

Supposedly JuliaLang/julia#50682 may help with this. We'll see!

@thofma
Copy link
Collaborator

thofma commented Jul 29, 2023

I tried with JuliaLang/julia#50682 and Hecke (which has the same problem). We still have the same problem, but only later. Right before the OOM killer did its job, I saw:

Heap stats: bytes_mapped 11330.77 MB, bytes_resident 4593.22 MB,
heap_size 17854.11 MB, heap_target 23454.11 MB, live_bytes 92905.01 MB

This is on a machine with 32 GB of memory (of which 21 were available). Not sure about those numbers, but how can 92 GB live byes be possible?

@thofma thofma changed the title Seemingly random CI failures for ubuntu, probably out-of-memory issues CI failures for ubuntu on 1.10 and nightly, probably out-of-memory issues Jul 31, 2023
HereAround added a commit that referenced this issue Jul 31, 2023
@lgoettgens
Copy link
Member Author

For the record, this happens on julia 1.6 as well. See https://github.com/oscar-system/Oscar.jl/actions/runs/5714074722/job/15480670050?pr=2609

@benlorenz
Copy link
Member

For the record, this happens on julia 1.6 as well. See oscar-system/Oscar.jl/actions/runs/5714074722/job/15480670050?pr=2609

That test is disabled for now (#2579), it also happened on 1.8. But that was rarely in contrast to the constant failures on nightly and 1.10 way earlier in our testsuite.

@thofma
Copy link
Collaborator

thofma commented Aug 1, 2023

@benlorenz Can you reproduce it locally (w/o JuliaLang/julia#50682)?

@benlorenz
Copy link
Member

I haven't really tested without that PR. I have run a bunch of tests with that PR merged and it reliably crashes when running it in a memory-constrained cgroup. A problem with that is that malloc behaves slightly different in such a cgroup than when the memory is really full (it doesn't return null but instead the process will be killed if the limit is exceeded).
This causes it to crash even when setting a very low --heap-size-hint value.

I have run various tests in the Github CI (7GB RAM on Ubuntu) to find a value for the heap size hint that works, maybe around 3.5GB (This is with plain nightly / 1.10.0-beta1).
Even with such a low value the memory usage can grow too much for the runners causing the process to be killed. See e.g here https://github.com/oscar-system/Oscar.jl/actions/runs/5723920046/attempts/2 and https://github.com/oscar-system/Oscar.jl/actions/runs/5718082420/attempts/1 (the extra number at the end of the job is the heap size hint in GB).

For comparison: julia 1.9.2 runs through the whole testsuite (about 80 minutes) with 8GB heap-size-hint in a 8.5GB cgroup, while julia nightly does not succeed in a 10.5GB cgroup with a hint of 8GB.

I think there are several things contributing here:

  • There does not seem to be a default heap size hint on nightly and 1.10 (jl_gc_get_max_memory by default returns 2PB)
  • Compilation seems to use quite a bit more memory than on 1.9 and earlier, as we observed in QuadFormAndIsom sometimes needing absurd time in CI #2599. That doctest example has 2.445 GiB for Sys.maxrss() on 1.9.2 after the test, but 4.424 GiB on nightly.
  • Miscounting of memory (live_bytes) which happens with both Hecke and Oscar. [1]
  • Maybe more?

[1]:
The miscounting can be triggered with some GAP tests (this uses gcext, but Hecke seems to trigger this without gcext as well):

julia> using Test

julia> using Oscar
 -----    -----    -----      -      -----   
|     |  |     |  |     |    | |    |     |  
|     |  |        |         |   |   |     |  
|     |   -----   |        |     |  |-----   
|     |        |  |        |-----|  |   |    
|     |  |     |  |     |  |     |  |    |   
 -----    -----    -----   -     -  -     -  

...combining (and extending) ANTIC, GAP, Polymake and Singular
Version 0.13.0-DEV ... 
 ... which comes with absolutely no warranty whatsoever
Type: '?Oscar' for more information
(c) 2019-2023 by The OSCAR Development Team

julia> GC.enable_logging(true)

julia> GC.gc(false); GC.gc(true)
GC: pause 187.85ms. collected 281.680916MB. incr 
Heap stats: bytes_mapped 448.11 MB, bytes_resident 335.88 MB, heap_size 490.38 MB, heap_target 752.88 MB, live_bytes 341.70 MB
, Fragmentation 0.697GC: pause 86.89ms. collected 0.282654MB. full recollect
Heap stats: bytes_mapped 448.11 MB, bytes_resident 335.75 MB, heap_size 490.25 MB, heap_target 752.75 MB, live_bytes 382.11 MB
, Fragmentation 0.779GC: pause 386.24ms. collected 1.028664MB. incr 
Heap stats: bytes_mapped 448.11 MB, bytes_resident 335.62 MB, heap_size 489.22 MB, heap_target 751.72 MB, live_bytes 381.95 MB
, Fragmentation 0.781
julia> @test GAP.Globals.TestDirectory(GAP.Globals.DirectoriesPackageLibrary(GAP.Obj("OscarInterface"), GAP.Obj("tst")))
Architecture: x86_64-pc-linux-gnu-julia1.11-64-kv8

testing: /home/lorenz/software/polymake/julia/Oscar.jl/src/../gap/OscarInterface//tst/alnuth/ALNUTH.tst
# line 1 of 86 (1%)GC: pause 85.86ms. collected 0.753052MB. full recollect
Heap stats: bytes_mapped 448.11 MB, bytes_resident 335.62 MB, heap_size 489.23 MB, heap_target 751.73 MB, live_bytes 380.92 MB
, Fragmentation 0.779GC: pause 368.91ms. collected 0.018028MB. incr 
Heap stats: bytes_mapped 448.11 MB, bytes_resident 335.61 MB, heap_size 489.21 MB, heap_target 751.71 MB, live_bytes 380.97 MB
# line 18 of 86 (20%)GC: pause 99.35ms. collected 333.437424MB. incr 
Heap stats: bytes_mapped 512.12 MB, bytes_resident 500.62 MB, heap_size 676.11 MB, heap_target 982.36 MB, live_bytes 380.95 MB
, Fragmentation 0.563GC: pause 126.04ms. collected 522.456375MB. incr 
Heap stats: bytes_mapped 704.17 MB, bytes_resident 670.53 MB, heap_size 875.80 MB, heap_target 1313.30 MB, live_bytes 413.54 MB
, Fragmentation 0.472GC: pause 216.44ms. collected 754.326622MB. incr 
Heap stats: bytes_mapped 1024.25 MB, bytes_resident 964.05 MB, heap_size 1164.24 MB, heap_target 1689.24 MB, live_bytes 454.10 MB
# line 81 of 86 (94%)GC: pause 230.63ms. collected 1098.841599MB. incr 
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 1126.45 MB, heap_size 1333.83 MB, heap_target 1683.83 MB, live_bytes 551.80 MB
   44871 ms (1092 ms GC) and 11.2MB allocated for alnuth/ALNUTH.tst
testing: /home/lorenz/software/polymake/julia/Oscar.jl/src/../gap/OscarInterface//tst/alnuth/examples.t\
st
# line 1 of 271 (0%)GC: pause 146.92ms. collected 81.539536MB. full recollect
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 1102.11 MB, heap_size 1307.03 MB, heap_target 1744.53 MB, live_bytes 585.39 MB
, Fragmentation 0.448GC: pause 390.70ms. collected 155.567131MB. incr 
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 803.80 MB, heap_size 994.14 MB, heap_target 1387.89 MB, live_bytes 574.64 MB
    6042 ms (561 ms GC) and 15.3MB allocated for alnuth/examples.tst
testing: /home/lorenz/software/polymake/julia/Oscar.jl/src/../gap/OscarInterface//tst/alnuth/manual.tst
# line 1 of 64 (1%)GC: pause 160.84ms. collected 188.842827MB. full recollect
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 787.28 MB, heap_size 1011.43 MB, heap_target 1405.18 MB, live_bytes 419.07 MB
, Fragmentation 0.414GC: pause 356.13ms. collected 0.587646MB. incr 
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 787.28 MB, heap_size 1010.73 MB, heap_target 1404.48 MB, live_bytes 479.72 MB
     818 ms (592 ms GC) and 3.63MB allocated for alnuth/manual.tst
testing: /home/lorenz/software/polymake/julia/Oscar.jl/src/../gap/OscarInterface//tst/alnuth/polynome.t\
st
# line 2 of 47 (4%)GC: pause 122.30ms. collected 23.319710MB. full recollect
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 787.28 MB, heap_size 1008.29 MB, heap_target 1402.04 MB, live_bytes 479.13 MB
, Fragmentation 0.475GC: pause 334.29ms. collected 32.628418MB. incr 
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 787.28 MB, heap_size 1002.85 MB, heap_target 1396.60 MB, live_bytes 499.14 MB
# line 16 of 47 (34%)GC: pause 193.51ms. collected 294.139137MB. incr 
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 791.02 MB, heap_size 1299.46 MB, heap_target 1736.96 MB, live_bytes 466.51 MB
   40842 ms (925 ms GC) and 392MB allocated for alnuth/polynome.tst
-----------------------------------
total     92573 ms (3170 ms GC) and 423MB allocated
              0 failures in 4 files

#I  No errors detected while testing

GC: pause 109.34ms. collected 542.099785MB. incr 
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 804.59 MB, heap_size 1525.13 MB, heap_target 1787.63 MB, live_bytes 1611.86 MB
, Fragmentation 1.057Test Passed

julia> GC.gc(false); GC.gc(true)
GC: pause 25.73ms. collected 29.697510MB. incr 
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 785.30 MB, heap_size 1092.44 MB, heap_target 1486.19 MB, live_bytes 8130.76 MB
, Fragmentation 7.443GC: pause 71.46ms. collected 76.772232MB. full recollect
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 785.30 MB, heap_size 1092.44 MB, heap_target 1486.19 MB, live_bytes 8120.73 MB
, Fragmentation 7.434GC: pause 282.94ms. collected 7.942421MB. incr 
Heap stats: bytes_mapped 1344.33 MB, bytes_resident 784.48 MB, heap_size 1091.55 MB, heap_target 1485.30 MB, live_bytes 8043.96 MB
, Fragmentation 7.369

Note that during the GC call after the tests were done the live_bytes size jumps to 8GB.

@gbaraldi
Copy link

gbaraldi commented Aug 1, 2023

I can't reproduce that, which is really odd.

@benlorenz
Copy link
Member

I can't reproduce that, which is really odd.

The live_bytes issue? Maybe try it like this:

$ julia-nightly
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.11.0-DEV.203 (2023-07-31)
 _/ |\__'_|_|_|\__'_|  |  Commit ec8df3da359 (1 day old master)
|__/                   |

(@v1.11) pkg> activate --temp
  Activating new project at `/tmp/jl_GThsJo`

(jl_GThsJo) pkg> add Oscar#master
<....>
julia> using Oscar; using Test;
<....>
julia> GC.enable_logging(true)

julia> @test GAP.Globals.TestDirectory(GAP.Globals.DirectoriesPackageLibrary(GAP.Obj("OscarInterface"), GAP.Obj("tst")))
<....>

Test Passed

julia> GC.gc(false); GC.gc(true)
GC: pause 75.32ms. collected 554.220963MB. incr
Heap stats: bytes_mapped 2176.53 MB, bytes_resident 886.66 MB, heap_size 1578.59 MB, heap_target 2103.59 MB, live_bytes 1634.71 MB
, Fragmentation 1.036GC: pause 69.74ms. collected 93.646736MB. full recollect
Heap stats: bytes_mapped 2176.53 MB, bytes_resident 875.67 MB, heap_size 1162.12 MB, heap_target 1599.62 MB, live_bytes 8104.16 MB
, Fragmentation 6.974GC: pause 221.34ms. collected 7.917084MB. incr
Heap stats: bytes_mapped 2176.53 MB, bytes_resident 874.69 MB, heap_size 1161.03 MB, heap_target 1598.53 MB, live_bytes 8010.67 MB
, Fragmentation 6.900

This happens reliably on two (very) different Linux machines with an official nightly binary (and also with a completely fresh julia depot).

I will try some debugging.

@gbaraldi
Copy link

gbaraldi commented Aug 1, 2023

So it seems there are a couple of issues. I fixed a bug on the PR but it still seems that our live_bytes tracking is no longer accurate (it doesn't actually affect any of the GC decisions on 1.10). live_bytes should never be greater than heap_size because it's bounded by it. At the same time I also realized that heap-size-hint was broken. Hopefully I fixed it in the last commit. It seems to get more and more wrong over time. We are potentially either double counting allocations or missing frees in the accounting.
It also has some weird things where it potentially doesn't apply to children processes JuliaLang/julia#50673

@lgoettgens
Copy link
Member Author

Unfortunately, JuliaLang/julia#50682 being present in julia nightly didn't fix this issue. Any idea on how to proceed?

@thofma
Copy link
Collaborator

thofma commented Aug 11, 2023

We should mention this over at JuliaLang/julia#50705

@gbaraldi
Copy link

I'll play with it a bit more, see if I can figure something out.

@gbaraldi
Copy link

gbaraldi commented Aug 11, 2023

Did you run on github action or with a local cgroup, because it does seem that the testset is very close to needing 7GB, at least when running locally. And from looking at it does seem like the GC is running, at least locally. It's just not able to free stuff.
@benlorenz could you run the tests with your cgroups again?

@fingolfin
Copy link
Member

What's the status of this? Have the changes in Julia fixed this? Or did we end up disabling some tests (which?) to workaround?

In other words: does this need to stay open? If so, it would be good to have an explicit pointer to whatever disabled tests can be used to repro the issue.

@benlorenz
Copy link
Member

There are no disabled tests.
I think several things improved this:

  • fixes in julia
  • split up testsuite
  • larger runners

I don't think this needs to stay open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants