Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

evaluator: parsing antler CUE configs can exhaust system memory #3452

Open
heistp opened this issue Sep 13, 2024 · 7 comments
Open

evaluator: parsing antler CUE configs can exhaust system memory #3452

heistp opened this issue Sep 13, 2024 · 7 comments

Comments

@heistp
Copy link

heistp commented Sep 13, 2024

What version of CUE are you using (cue version)?

$ cue version
cue version v0.5.0

go version go1.23.0
      -buildmode exe
       -compiler gc
  DefaultGODEBUG asynctimerchan=1,gotypesalias=0,httplaxcontentlength=1,httpmuxgo121=1,httpservecontentkeepheaders=1,netedns0=0,panicnil=1,tls10server=1,tls3des=1,tlskyber=0,tlsrsakex=1,tlsunsafeekm=1,winreadlinkvolume=0,winsymlink=0,x509keypairleaf=0,x509negativeserial=1
     CGO_ENABLED 1
          GOARCH amd64
            GOOS linux
         GOAMD64 v1

Does this issue reproduce with the latest stable release?

Yes. It's the same or possibly worse in v0.10.0, with or without CUE_EXPERIMENT=evalv3.

What did you do?

I created a CUE package for Antler in this sce-tests repo. This is an Antler test config with 216 tests, that uses both large lists (generated programmatically with Go templates), and CUE list comprehension, that likely results in a large CUE graph.

The biggest culprit, it seems, is the Run list for my FCT tests. This creates list of 1200 elements (using a Go template), which is used with list comprehension to generate StreamClients. When Antler goes to unify the schema with the config using the CUE API using the CUE API, the process memory reported by top rises very quickly, and can complete exhaust the system memory, depending on the hardware. If I comment out this list, it's still slow and uses a lot of system memory compared to what I'd hope for, but it's at least much faster.

To reproduce it, one can install Antler, pull the sce-tests repo, and run antler vet to parse the config. My hope is that this isn't necessary for you to do, and just based on the description, you can identify the category of performance problem referred to in the Performance umbrella issue, so I have a sense of if or when this may be improved.

Also, I might be able to work around this by avoiding large lists, but it's flexible for users to provide their own statistical distribution of wait times and flow lengths, and these lists can simply get long. On top of that, this project will eventually at least triple in size with more tests, so I'll have to solve this somehow, and am just looking for advice. Would this be any better in v0.11.0-alpha.1, or with any other config options?

What did you expect to see?

The config to parse reasonably quickly.

What did you see instead?

Excessive memory allocations.

A Linux laptop with 8G of RAM and 8G of swap runs out of memory entirely when parsing the config.

Another box with 16G of RAM and 8G of swap is able to parse the config without running out of memory, but just barely.

@heistp heistp added NeedsInvestigation Triage Requires triage/attention labels Sep 13, 2024
@mvdan
Copy link
Member

mvdan commented Sep 17, 2024

Is it possible to reproduce this slowness via the cue command on your sce-tests repo, for example via cue eval or cue export?

@heistp
Copy link
Author

heistp commented Sep 18, 2024

Yes, although it takes a few steps:

  1. Pull the sce-tests repo.
  2. Add the file fct.cue from the attached fct.cue.gz, which comes from the antler vet command, but is attached here so you don't have to install antler to generate it.
  3. Edit sce.cue to uncomment the section under "polya fct tests", which I currently have commented out to prevent the memory problem.
  4. To get the config schema, copy the file config.cue into the same directory and change its package to "sce".
  5. Run cue export.

When running cue export, that will all be run on static cue files outside of antler, and it shows the same memory problem. Be prepared though that a machine with 8G might become unusable and need a hard reboot. A machine with 16G should be able to handle it.

@heistp
Copy link
Author

heistp commented Nov 21, 2024

I upgraded my box to 64 GB RAM and did some testing with CUE v0.5.0 and CUE v0.11.0. This is the resident memory reported after the config is completely parsed:

CUE v0.5.0: 9.6 GB
CUE v0.11.0: 22.6 GB
CUE v0.11.0 with CUE_EXPERIMENT=evalv3: 39.2 GB

I appear to be stuck on v0.5.0. Or, are there any other experimental flags I can try?

@mvdan
Copy link
Member

mvdan commented Nov 21, 2024

That's it for now. We still have some performance and memory usage work to be done on evalv3, so that's still our focus for issues like this one.

@heistp
Copy link
Author

heistp commented Dec 2, 2024

Just adding to this that I reduced the CPU and memory considerably by removing all the disjunctions I was using in my config schema.

CUE v0.5, with disjunctions:

54.40user 2.69system 0:29.84elapsed 191%CPU (0avgtext+0avgdata 10471328maxresident)k
0inputs+56outputs (0major+2691891minor)pagefaults 0swaps
/bin/time antler vet  54.40s user 2.70s system 191% cpu 29.849 total

CUE v0.5, without disjunctions:

6.76user 0.41system 0:04.37elapsed 164%CPU (0avgtext+0avgdata 1487272maxresident)k
0inputs+56outputs (0major+386054minor)pagefaults 0swaps
/bin/time /tmp/antler vet  6.77s user 0.42s system 164% cpu 4.374 total

CUE v0.11, without disjunctions, without CUE_EXPERIMENT=evalv3:

27.54user 1.20system 0:16.08elapsed 178%CPU (0avgtext+0avgdata 4745528maxresident)k
0inputs+56outputs (0major+1255867minor)pagefaults 0swaps
/bin/time /tmp/antler vet  27.55s user 1.21s system 178% cpu 16.090 total

CUE v0.11, without disjunctions, with CUE_EXPERIMENT=evalv3 (however this got a "field not allowed" error on a line number that doesn't make sense yet, so I'll try to sort this out later):

12.57user 1.32system 0:05.37elapsed 258%CPU (0avgtext+0avgdata 5138912maxresident)k
0inputs+56outputs (0major+1305055minor)pagefaults 0swaps
CUE_EXPERIMENT=evalv3 /bin/time /tmp/antler vet  12.58s user 1.33s system 258% cpu 5.375 total

At least this shows that in my case, the way I'm using disjunctions (to enforce that only one field is set in a struct, and those structs are themselves used inside of recursive structs) is a pretty big portion of the resource consumption. I can make a workaround for this, but also look forward to things returning to a v0.5 level of performance, or better one day. 🤞

@mvdan
Copy link
Member

mvdan commented Dec 2, 2024

Don't spend energy reducing that one "field not allowed" error, as we are aware of regressions in that space: #3601

We are going to continue with the performance work once these known regressions are fixed. Follow #2850 for updates :)

@heistp
Copy link
Author

heistp commented Dec 2, 2024

Don't spend energy reducing that one "field not allowed" error, as we are aware of regressions in that space: #3601

We are going to continue with the performance work once these known regressions are fixed. Follow #2850 for updates :)

Ok, that's a time saver, thanks.

@myitcv myitcv changed the title parsing antler CUE configs can exhaust system memory evaluator: parsing antler CUE configs can exhaust system memory Dec 6, 2024
@myitcv myitcv added evaluator and removed Triage Requires triage/attention labels Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants