Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Performance testing #2434

Closed
Lakitna opened this issue Aug 28, 2020 · 44 comments
Closed

Discussion: Performance testing #2434

Lakitna opened this issue Aug 28, 2020 · 44 comments
Labels
☠ stale Marked as stale by the stale bot, will be removed after a certain time.

Comments

@Lakitna
Copy link
Contributor

Lakitna commented Aug 28, 2020

I've been talking with @nicojs about performance testing Stryker to help prevent performance regression. We've talked about this in various issues. Let's bundle the discussion here.

Current performance tests

The current performance benchmarks are executed on a high level. In practical terms, this comes in the form of letting Stryker loose on various projects. At the time of writing the /perf folder contains an Angular app, and I'm working on adding Express next to that #2417. As with any high-level test, this is not the fastest way of getting information. The baseline I executed for Express took about 8 minutes in Stryker@3.

You may have noticed I called them 'performance benchmarks' just now. The current setup does not do any warmup or repetition. This makes it a 'performance benchmark' rather than a proper 'performance test'. The reason for this comes back to the current tests being not the fastest.

Potential ways forward

Dropping down

It is very much possible to execute performance tests on a lower level. We could even make it low level enough that Stryker can test the performance tests for mutations. Now that's dogfooding! We probably shouldn't do this though.

The biggest advantages of this are:

  • Results should be more stable due to repetition. Performance always comes in a range. The difference between the slowest and fastest times can be significant. This becomes more apparent when a single iteration is very fast (think bout < 100ms).
  • Results should vary less between computers with different hardware. Low-level stuff is more forgiving on your hardware.
  • The performance test suite in its totality should run quicker. Low-level stuff has less overhead.
  • Tests can be targeted more easily.
  • More data points can provide more insight and make tracing performance issues easier.

Biggest disadvantages:

  • May not reflect the actual performance that users are getting.
  • Requires tailor-made tests

This would bring is in the realm of tools like benchmark.js, nanobench, matcha, exectimer, etc. There are a lot of options here.

However, I do not know how well this would work in Stryker. I'm not familiar enough with the codebase. Are there semantically separate parts of Stryker that could be worth performance testing separately?

Stay up high

We could also stick with the high-level approach. We would basically keep simulating a user by copying in open source projects that give us a decent coverage of the type of projects Stryker can end up in.

The biggest advantages of this are:

  • Results better reflect the reality of users.
  • Requires comparatively little effort. We can use existing open-source projects and benefit from their efforts.

Biggest disadvantages:

  • Execution will be a lot slower. It might even become slow enough that we can't run all benchmarks in CI.
  • Fewer data points can make tracing issues more difficult.

This would probably bring us in the realm of CLI-focussed tools. I've used some before, but I don't think there are any written in JavaScript. This would add tech stack complexity. We could also write something ourselves using tools like execa in combination with one of the tools mentioned before. This should not be a lot of work.

Continuous performance testing

Most people are familiar with things like coverage gates & regression warnings. This is also possible with performance.

We could use something like this Github action: https://github.com/rhysd/github-action-benchmark. This can provide us with warnings like this:

On a practical note, this requires the performance tests in CI to not take very long. Currently, the slowest pipeline seems to be E2E-windows with about 23 minutes. I suggest ensuring that the performance tests to not take any longer than the E2E tests.

When comparing runs you have to make sure that both runs have been completed in the same circumstances. Luckily, pipeline agents tend to be consistently specced.

Use cases

Stryker has multiple use cases, defined by the technology behind the projects. For example, the Express benchmark covers the use case: CommonJS without instrumentation & the Mocha test runner.

Other use cases include:

  • Typescript + bundler & Jest
  • Angular & Karma
  • CommonJS + instrumenter & Jasmine
  • etc.

We should probably identify interesting use cases we want to test.

@nicojs
Copy link
Member

nicojs commented Aug 28, 2020

Thanks a lot, @Lakitna ! This is all valuable insides, learned a lot from just reading this issue 😍

Dropping down

I haven't done a lot of thinking on "dropping down" with performance testing tools. I'm not sure how much value that would give us. For example, if the performance of inserting mutants in the code decreases by a factor of 2, this wouldn't have a big impact on running Stryker, because running tests take up > 95% of the time spend on mutation testing. However, a performance degradation that is known is always better than one that is not known, and it might be an undesired side effect of a change.

Are there semantically separate parts of Stryker that could be worth performance testing separately?

I think the current plugin structure provides a nice, hard cut between separate parts. I can think of 2 places on the top of my head. (note: we can still drop down deeper of course, but I'm not sure how useful that would be)

  1. Each test runner plugin could be performance tested. We could create a generic test runner performance test harness that works with the test runner API (see packages/api/test_runner2.ts (will be renamed to test_runner shortly)). That way we could performance test the impact of hot reload (Implement hot reload for mocha test runner #2413) for example.
  2. The packages/instrumenter package is responsible for the mutation switching itself. It also has integration tests for the Instrumenter class itself (responsible for parsing, generating mutants, placing mutants, and printing the AST back to a file).

Stay up high

We could also write something ourselves using tools like execa in combination with one of the tools mentioned before.

We currently use execa together with console.time statements. See https://github.com/stryker-mutator/stryker/blob/7cfb8f1568530439d8bbf40c87b9ce1ab1fa7e96/perf/tasks/run-perf-tests.ts

It currently provides output like this:

$ ~/stryker-mutator/stryker/perf (fix/perf-test-express)
$ PERF_TEST_GLOB_PATTERN=express npm t
> ts-node tasks/run-perf-tests.ts
Running performance tests on express (matched with glob pattern "express")
Exec express npx stryker run
express: 1335.327ms last log message:  15:51:00 (6148) INFO ConfigReader Using stryker.conf.json
express: 68490.496ms last log message:  Mutation testing 13% (elapsed: ~1m, remaining: ~6m) 276/2017 tested (46 survived, 2 timed out)
express: 128546.600ms last log message:  Mutation testing 25% (elapsed: ~2m, remaining: ~5m) 523/2017 tested (101 survived, 4 timed out)
// ...
express: 488871.011ms last log message:  Mutation testing 84% (elapsed: ~8m, remaining: ~1m) 1703/2017 tested (274 survived, 16 timed out)
express: 548916.050ms last log message:  Mutation testing 97% (elapsed: ~9m, remaining: <1m) 1961/2017 tested (325 survived, 16 timed out)
express: 553141.450ms
all tests: 553141.741ms
Done

I would be happy to improve this output with a benchmark tool of some sort. I also would like to see the output in the form of the clear-text table for example (right now, all mutants might error and we wouldn't even know).

Continuous performance testing

Currently, I've configured the performance test workflow to be able to run on command with the new workflow_dispatch trigger. You can also choose which one you want to run. I just started it here: https://github.com/stryker-mutator/stryker/actions/runs/228789894

image

I think this is fine to start with. When a pull request changes something significant, we'll trigger the workflow before approving as part of the review. We can think of better ways to do this in the future.

Other use cases include:
...

I think this should be the list:

  • JavaScript+Mocha: express
  • Angular & Karma: current webshop example (angular version needs to be updated to be compatible with Stryker 4)
  • TypeScript+Mocha: Stryker itself? (performance test an older version of Stryker for example).
  • Javascript/Typescript & Jest
  • Jasmine-runner (nodejs)
  • A react project: Maybe we can use react-bootstrap ?

@Lakitna
Copy link
Contributor Author

Lakitna commented Aug 31, 2020

Dropping down

  1. Each test runner plugin could be performance tested. We could create a generic test runner performance test harness that works with the test runner API (see packages/api/test_runner2.ts (will be renamed to test_runner shortly)). That way we could performance test the impact of hot reload (Implement hot reload for mocha test runner #2413) for example.

I really like doing this, because it will make the high-level perf tests a lot simpler. The impact on runtime can be traced back to 4 factors:

  1. Test runner (e.g. Mocha)
  2. Source code instrumenter (e.g. Typescript)
  3. Mutation count (roughly relates to lines of source code)
  4. Test count

Removing factors from consideration in high-level tests will speed up performance testing tremendously. It would also allow us to easily distinguish between runner issues and mutation issues.

  1. The packages/instrumenter package is responsible for the mutation switching itself. It also has integration tests for the Instrumenter class itself (responsible for parsing, generating mutants, placing mutants, and printing the AST back to a file).

I don't think we should do this one at a lower level. I think this one makes a lot more sense in a high-level setting. The reason for this is that high-level stuff tends to be more real-world when compared to lower-level stuff.

Also, we only have to do once per test run (thanks to mutation switching). Therefore I would not care as much about the performance of this one. Pretty fast is probably good enough. And we can get a sense of its speed in high-level perf tests.

Stay up high

I would be happy to improve this output with a benchmark tool of some sort. I also would like to see the output in the form of the clear-text table for example (right now, all mutants might error and we wouldn't even know).

Yeah, I would suggest putting things in an existing benchmark tool so we can make use of their stability and reporting abilities. Other than that, I don't think things will be much different.

Asserting that the runs end with an expected mutation coverage feels like a good idea. Though we don't even need the clear-text table for that. You could also use Strykers thresholds.break option.

That being said, we found that I had a lot more timeouts that you did in the Express bench. We might have to assert timeouts too. For that, you would need a reporter.

Continuous performance testing

We can start with the current pipeline. But I strongly suggest adding a trigger to every PR at some point. Performance is very important in Stryker so it should always be visible for everyone.

I think this should be the list:
...

If we drop down for the test runners (as above) we can condense this list:

  • JavaScript+Mocha: express
  • Angular & Karma: current webshop example (angular version needs to be updated to be compatible with Stryker 4)
  • TypeScript+Mocha: Stryker itself? (performance test an older version of Stryker for example).
  • Javascript/Typescript & Jest
  • Jasmine-runner (nodejs)
  • A react project: Maybe we can use react-bootstrap ?

This would give some redundancy in Typescript instrumentation. But Angular and React are both popular, so I see value in adding them both. This also gives us CLI and web projects. We also cover the most popular runners (Mocha, Karma, & Jest respectively). Finally, we should make sure we vary the benchmarks in size (mutation count & test count).

As you can see, in these high-level tests we start combining variables. This is why it can be difficult to trace issues you find in high-level tests.

@bartekleon
Copy link
Member

bartekleon commented Nov 30, 2020

I think we should continue our talk here @Lakitna (#2618)
Since its a good place for discussions about performance testing (as title says :P), and itll help us not to lose track [@nicojs could I also pin this issue or sth coz i believe this is the most important one for now].

But going back: Latikna I have prepared some scenarios for testing:
AllInOneFile - all functions are in 1 file and all tests in 1 file (pretty straightforward) [ 1 * t = s ]
UniformlyDistributed - basically you n files with t functions, same for tests [ n * t = s ]
RandomlyDistributed - n files with r(n) functions [ r(0) + ... + r(n) = s ]

Code of example scenario:

const scenarioUniformlyDistributed = (size: number, distribution: number) => {
  for (let i = 0; i < size; i += distribution) {
    const filename = UID();
    saveToSrc(filename, createSourceFile(distribution));
    saveToTest(filename, createTestFile(distribution));
  }
};

And it could be now executed in git actions or other CI tool for different values to create source and test files which later on can be mutated by stryker.

The only problem I see for now is how to collect data. For now I only think of doing it by hand, I dunno other way, any suggestion / feedback is welcome.

@bartekleon
Copy link
Member

bartekleon commented Nov 30, 2020

Ok, I think I have finally done it XD
https://github.com/kmdrGroch/Stryker-performance-testing
with current test running:
https://github.com/kmdrGroch/Stryker-performance-testing/runs/1476530021

(github will hate me for using they VMs like this :D 16 runs GOOOO!)
(and if someone knows how to fix this issue with files it would be nice...)

@bartekleon
Copy link
Member

bartekleon commented Dec 1, 2020

Results (first check, might need more points):
AllInOneFile
image

mutants time (s)
100 8
500 11
1000 20
500 255
10000 992
20000 3755

UniformlyDistributed (mutants amount - 10000)
image

mutants per file time (s)
200 1022
400 925
500 905
1000 850
2000 796

RandomlyDistributed (mutants amount - 10000)
image
(here i had to include 0,0 coz scatter plot generator said i need at least 5 but 5th test failed :) )

number of files time (s)
5 981
10 892
20 898
40 1149

Thoughts for now: I do need more tests - to get more dots AND (more important) check if it is stable or i just randomly got big numbers (see 20000 in AllInOneFile).
Update probably tomorrow :)

@Lakitna
Copy link
Contributor Author

Lakitna commented Dec 1, 2020

mutants | time (s)
-- | --
100 | 8
500 | 11
1000 | 20
500 | 255
10000 | 992
20000 | 3755

I think the second 500 should be 5000?

Anyway, I did a little bit of Google Sheets magic for this. Should make the results easier to process. It's in the same document as the Express and Lighthouse bench. See the last tab here https://docs.google.com/spreadsheets/d/11dqDoxqbXVCQiBVtMq_eZpgMljTMPL-voI1MQdA-gDA/edit#gid=448418706

You can simply add new results on the left, and the three charts will update automatically.

image

I think we'll need more data points for any meaningful conclusions. Especially for the randomly distributed one.

@bartekleon
Copy link
Member

bartekleon commented Dec 1, 2020

I think the second 500 should be 5000?

yea, I was doing it by hand :P

(also added more tests, waiting for them to complete now :) [probably will take round 6h 😅 ), see: https://github.com/kmdrGroch/Stryker-performance-testing/runs/1479699022

@bartekleon
Copy link
Member

bartekleon commented Dec 1, 2020

Ok, tests have been done:

AllInOneFIle scenario:

image
image
image

Observations:

  • There are big jumps in some places, reaching even several times longer duration.
  • File having up to 5000 mutations seems the most reasonable to use
  • We do not handle a lot of mutations very well.
  • In the beginning there are ~100 mutants/s checked, but after a while it slows down (even to numbers like 30-40 when being near 20000)

Possible reasons:

  • Memory leaks
  • Some bad algorithm (O(n^2) and more)
  • Exceeding dedicated parts of memory, reallocations
  • Problems with reading / writing to big files / serialization / deserialization

Possible solutions:

  • Finding and fixing memory leaks
  • Changing algorithms

UniformlyDistributed

image
image
image

Observations:

  • It seems results don't depend on number of files that much
  • There might be performance drop while having lots of files

RandomlyDistributed

image
image
image

Observations:

  • Runs depend on number of files, but only if there is a lof of them

General observations

  • Having more than 20000 mutants / 400 files will most likely result in over an hour runs

If you have any other observations / conclussions, ideas for other scenarios please share them :)
@Lakitna @nicojs :)

@Lakitna
Copy link
Contributor Author

Lakitna commented Dec 2, 2020

Ooohh there are some really interesting trends here! I'd like to have some more data points to even out the noise. Can you trigger the same runs again? I'll process the results this time 😉 Just let me know which time metric you've used.

AllInOneFile

  • There are big jumps in some places, reaching even several times longer duration.

The big jumps might just be noise. This is one of the reasons I'd like to run again. More data points should reduce the amount of noise.

  • File having up to 5000 mutations seems the most reasonable to use

One of the first things I did after looking at the graphs was to add a new graph:

image

Mutations per second is basically a metric of mutant processing speed. It's an interesting one, but I think we don't have enough data points between 100 and 10000 mutations. It looks to be a reverse exponential relation though. Which makes sense.

  • We do not handle a lot of mutations very well.

Indeed, we do not. I mentioned before that I suspect RAM to be the bottleneck. Do we want to validate that?

  • In the beginning there are ~100 mutants/s checked, but after a while it slows down (even to numbers like 30-40 when being near 20000)

It's actually less than that, the fastest is 1000 mutants at 52.63 mutants/s. The slowest is 40000 mutants at 2.19mutants/s. A very significant difference. Should we investigate splitting mutants for big files (e.g. bundles)?

Possible solutions:

  • Finding and fixing memory leaks
  • Changing algorithms

Probably a good starting point. It would be great if we can flatten the curve in the first graph. Though I think we can't make it linear due to hardware limitations. There will always be some bottleneck!

UniformlyDistributed

I really don't know about this one. The results feel very noisy, maybe more data will help. At this point, there are not a lot of conclusions we can make but one: More files = slower. Which makes sense, I/O takes time.

@kmdrGroch I've added a chart, can you please check if I did things correctly?

RandomlyDistributed

This looks to be a linear relationship (when files >= 10). That's awesome! I would like some more data to prove it though.

We should be able to combine these results with the results of AllInOneFile.

When combining this with the AllInOneFile tests we can almost certainly conclude that Stryker scales exponentially with the size of the codebase.

@kmdrGroch I've added a chart, can you please check if I did things correctly?

@bartekleon
Copy link
Member

bartekleon commented Dec 2, 2020

Honestly I dont get Sheet3 and Sheet4. Do you have some time on Friday so we could make a meeting about these? I could also provide more info about package I have made and give you access there. (we could add more scenarios then ;). It would be also nice to somehow process data automatically. Maybe some scrapper with python? [GitHub will hate me for running 24h tests and then scrapping them hahahaha 😅 ).

I'd like to have some more data points to even out the noise. Can you trigger the same runs again? I'll process the results this time 😉 Just let me know which time metric you've used.

Yop I can add more data points. But it will take most likely about 10-20h to run 😅 (last one ran for 8h)
About metrics: basically the numbers in test cases correspond to number of functions (or files), and each of function produces 2 mutants.

build_and_test (14.x, all 50) - 1 file - 100 mutants. Basically number x 2 = mutants (with this function. We can change function tho)
build_and_test (14.x, uniform 6000 50) - 12000 mutants total and 100 mutants per file (also 2x - its basically 6000 functions total and 50 functions per file)
build_and_test (14.x, random 6000 5) - 12000 mutants total and 5 files (here the second number means number of files. Do not multiply it by 2. Number of mutants there is always different which could potentially create more noise, but its basically what we want there :P)

and about time i take: I take time from "Build & test" - i open it, scroll down to the bottom and read numbers from there
image

image

Also important note: If you want to clone and contribute - remember to cancel previous jobs - otherwise you will wait years for them to complete (for now we have 39 tasks and we want to add more [probably around 30 per scenario :P])

@bartekleon
Copy link
Member

https://github.com/kmdrGroch/Stryker-performance-testing/runs/1487159798
@Lakitna its runnning. We will now wait for a lil bit (84 tests I believe :D). I think meantime ill write scraper in python to get data faster :D

@Lakitna
Copy link
Contributor Author

Lakitna commented Dec 2, 2020

Honestly I dont get Sheet3 and Sheet4

Haha, I don't blame you, I was trying some stuff in Sheet3. I was trying to estimate the sweet spot for mutations per file. Didn't work very well though o.0

I used Sheet4 to get statistics on mutants per file in the Lighthouse bench. The bench has an average of 54.8 mutants per file.

@bartekleon
Copy link
Member

bartekleon commented Dec 2, 2020

I also started working on this scraper but i somehow cant get the results :P I am stuck in opening tests and getting time 😓 but at least i get all cases for now and open them (so im VERY CLOSE to automate it [it takes ~10 * number of tests seconds [so in 13 minutes you can get all tests :D).

Update: I dont think it is possible to scrape github from these values :c I guess to automate this we would need to either somehow get better knowledge of their API or get another type of script - they dont allow me to open "build_and_test" tab :/

@bartekleon
Copy link
Member

bartekleon commented Dec 2, 2020

@nicojs do we store execution time somewhere? Maybe i could create artifacts and get data from there 🤔

Ok I think I managed to create it. https://github.com/kmdrGroch/Stryker-performance-testing/actions/runs/396555042
But i guess we will see in the end of the runs 😅 (it will be sad if it is not generaring :) )

@Lakitna
Copy link
Contributor Author

Lakitna commented Dec 2, 2020

Isn't there a csv or json reporter? I seem to remember such a thing existing.

@bartekleon
Copy link
Member

bartekleon commented Dec 2, 2020 via email

@Lakitna
Copy link
Contributor Author

Lakitna commented Dec 2, 2020

I checked. There is a json reporter, but it doesn't expose runtime.

@bartekleon
Copy link
Member

bartekleon commented Dec 3, 2020

data can be also collected from artifacts now :) : https://github.com/kmdrGroch/Stryker-performance-testing/actions/runs/396555042

I have made simple tool for extraction and a lil of handling of data

package.zip
^ data i have collected and script for cleanups

already cleaned up data:
data.txt

@Lakitna
Copy link
Contributor Author

Lakitna commented Dec 3, 2020

Awesome, that makes processing the data sooo much easier!

Did you change anything to the any scenarios between this and the previous runs? There are two distinct trends right now. The faster growing one is the latest run.

image

@bartekleon
Copy link
Member

bartekleon commented Dec 3, 2020

i am scared that github uses different VMs or that runs depend on number of runs etc.
I can run tests again tho :D
also, did you remember about multiplying by 2?
ALL 6000 is actually 12000 mutants 🤔 @Lakitna

@Lakitna
Copy link
Contributor Author

Lakitna commented Dec 3, 2020

Yeah, I don't see any differences in the other things. Does the x2 also apply to AllInOneFile? If so, thats probably the issue here.

Edit: It looks like that solved it :)

@Lakitna
Copy link
Contributor Author

Lakitna commented Dec 3, 2020

All in one file

image

Just look at those tight trendlines! :) I think it's pretty safe to say that there is some sort of exponential growth going on here. I do wonder about the 20000-30000 range though.

The last graph is also pretty neat I think. You can clearly see that < 1000 mutants file operations are the bottleneck while > 1000 mutants the mutation testing is.

I would like to find the formula for the trend line here. I think we could use that for some interesting stuff. I've tried, but not yet succeded.

Uniformly distributed

image

Look at that, we can clearly see a linear relation between file count and execution time (when mutation count remains the same). Both graphs actually tell us the same thing as file count and mutations per file have an exponental relation. This result is underlined by the next one too

Randomly distributed

image

Again, the linear relation between file count and execution time. This is basically the exact same result as the uniformly distributed one. Which makes sense to me. File operations are the deciding factor here.

Conclusion

I think it's safe to say that the relation between file count and runtime is a linear one. That is awesome. For the next step we could see if we can make a general definition for this relation. Only if we find it useful though, I'm not sure yet if I think that to be useful.

The biggest realisation to me here is that there should be no huge performance difference between a source code bundle and source code files. Just a linear increase based on the file count. We might want to validate this and include this tip for large codebases in the docs.

The relation between mutation count and runtime is an exponential one. If we want to improve performance we should focus our efforts on this relation. This also highlights why random mutation sampling #2584 is so much faster.

@bartekleon
Copy link
Member

bartekleon commented Dec 3, 2020

I do wonder about the 20000-30000 range though.

I had test cases for these too, but github had max 6hours to timeout and all of these timed out 😅
Will push fix today and we should get all tests this time :)

Only if we find it useful though, I'm not sure yet if I think that to be useful.

I think it would be, if we knew +- the corelations and speed, we could better estimate time for tests to end ;) e.x my bigmath app gets info that it would be tested in e.x 4-5minutes in the beginning but ends up with 7-8 coz of drops etc :) we could give better time info + we find bottlenecks, and room to improve, test again, and check if correlations are still the same ;)

What bothers me, is that we have exponential growth for mutants... I wonder why is that, and thinking how could we make it more of polynomial. We would somehow need to find bottleneck of it. Maybe some n^3 functions on top of mutants. We should also check all "filter", "map" and other functions on mutants, since they create new arrays which could increase ram usage, add some unnecessary stuff... especially if you run filter.filter.map instead of doing everything in one loop.

If you are able to like run ~20000 mutants in a performance test to get what uses how much processing, it would probably help us. I think using simple functions like i used here would help us more - giving cleaner results, than using "random" ones (like express). If might also be that bottleneck is Babel/test tools. It would be pain then :)

@Lakitna
Copy link
Contributor Author

Lakitna commented Dec 3, 2020

I reached the end of Sheets capabilities, so I stepped up to a more powerful program. You can see everything here in glorious interactive mode and such!

https://public.tableau.com/profile/lakitna#!/vizhome/Strykerp/Dashboard

The data has also been moved. I needed a cleaner data source. I used a new Sheets for that: https://docs.google.com/spreadsheets/d/1EhMUTFGiXkRK7giGinK6J_yzkjnyTmxVSl4u9OTDTK4/edit?usp=sharing

Big improvements are in the ease of creating visualisations and trend lines. It also includes uncertainty bands on trend lines. It also allows us to select only a few runs, and see what it does on the other graphs. It's awesome! I have a lot more faith in these visualisations as there is a lot less room for errors.

I did find out that the mutation count isn't quite exponential, but it's definitely the most important factor.

Let me know what you think :)

@Lakitna
Copy link
Contributor Author

Lakitna commented Dec 3, 2020

If you are able to like run ~20000 mutants in a performance test to get what uses how much processing, it would probably help us. I think using simple functions like i used here would help us more - giving cleaner results, than using "random" ones (like express). If might also be that bottleneck is Babel/test tools. It would be pain then :)

Yeah, I can do that. We won't be able to use the data point in the statistics, but it might help with tracking down what's causing the relationship to be exponential-ish.

I might be able to do this tonight, but no promises!

Edit: Thinking about predicting better runtimes. Can you include the duration of the initial test run in the scraper? Initial test run duration + mutation count + file count might be enough to estimate very closely. If we can find the relation between those three we might be able to formulate some solid advice for users. Something like "Expected run to take 02:21:56. It looks like the bottleneck is the number of source files. Bundling your source code before testing could save you about 00:32:03".

It might be overkill though 👼

@bartekleon
Copy link
Member

bartekleon commented Dec 3, 2020

Yea, i was also checking our results, and indeed i found that growth is more likely to be polynomial (much better R^2 with our graphs), tho unfortunatelly in both cases these equations are soo ugly
image

Can you include the duration of the initial test run in the scraper?

Yeaa I believe so, i could try adding it tomorrow

I think it was mentioned in some other issue long time ago, but I was also thinking and i think it would be nice, if we could add anonymous reporting - PC spec / number of files / number of mutations / time / initial time.
With more data we could also get more information how this works IRL, not only in our "lab env", but we would need @nicojs permission :)

@bartekleon
Copy link
Member

@Lakitna I think no chance for 50000/60000 mutants... actions keeps cancelling them :c
https://github.com/kmdrGroch/Stryker-performance-testing/actions/runs/398169972

@Lakitna
Copy link
Contributor Author

Lakitna commented Dec 3, 2020

I've actually tried to made a 2d prediction just now:

Mutation trend line:

duration(mutations) = (0.0000098942*mutations^2 + 0.00435359*mutations - 5.71146) + 5.79557

Files (random) trend line:

duration(files) = 5.79557 * files + 1254.74

I basically took the trendline for the mutation count and added to that the vector of the file trend to it. The -1 is because the mutation trend is at files = 1.

duration(mutations, files) = (0.0000098942*mutations^2 + 0.00435359*mutations - 5.71146) + 5.79557 * (files - 1)

Let's throw some maths at it to simplify it:

t(a, b) = 9.8942e-6a^2 + 4.35359e-3a + 5.79557b - 11.50703

I've even plotted the function: image

According to Wolfram Alpha it's a parabolic cylinder if anyone is interested.

@Lakitna
Copy link
Contributor Author

Lakitna commented Dec 3, 2020

@kmdrGroch I don't think it matters too much. The trendlines fit very well, I would be amazed if anything weird happens in higher numbers.

I might be able to do it on my machine though, I'll take a look while I'm at it.

@bartekleon
Copy link
Member

bartekleon commented Dec 3, 2020

I might be able to do it on my machine though, I'll take a look while I'm at it.

then you need to compare it to all other tests as well :/
otherwise it makes no sense since your machine is more powerful than these VMs

duration(mutations, files) = (0.0000098942mutations^2 + 0.00435359mutations - 5.71146) + 5.79557 * (files - 1)

interesting O.O Is it like in minutes or seconds?

t(a, b) = 9.89e-5a^2 + 4.35e-4a + 5.79557b - 11.59114

i tried running for t(40000, 1) but it doesnt match our results 🤔

@Lakitna
Copy link
Contributor Author

Lakitna commented Dec 3, 2020

It outputs seconds. I've tried it with a bunch of combinations. But now that I think about it, I didn't check > 12000

@Lakitna
Copy link
Contributor Author

Lakitna commented Dec 3, 2020

Oh wow, I really messed up the formula rewrite o.0 Try this instead :)

t(a, b) = 9.8942e-6a^2 + 4.35359e-3a + 5.79557b - 11.50703

@Lakitna
Copy link
Contributor Author

Lakitna commented Dec 4, 2020

I decided to do a full sweep of the AllInOneFile scenario on my pc. Results in the second tab here https://docs.google.com/spreadsheets/d/1EhMUTFGiXkRK7giGinK6J_yzkjnyTmxVSl4u9OTDTK4/edit#gid=232513790

The trendline currently has R^2 = 0.9998 and P < 0.0001. That's ridiculously tight! The formula is, as expected, significantly different from the previous one. This again proves the non-linear relation between mutation count and duration. This time, proven on a more powerful system.

During the runs, I ran between 50 and 100000 mutations in various increments. It failed at 60000 due to JavaScript heap out of memory. I'm currently running 52500, 55000, and 57500 to try to narrow down at which point it starts to fail. I suspect RAM to be the issue, but I want to validate that its a hardware limitation. This will take all day to run. The 50000 run was 03:31:40.

@bartekleon
Copy link
Member

bartekleon commented Dec 4, 2020

t(a, b) = 9.8942e-6a^2 + 4.35359e-3a + 5.79557b - 11.50703

this is still not accurate... I believe that there might be ab coef somewhere or even a^2b and ab^2 (see uniform scenario - always 12000 mutants and more files -> so it should growing all the time, but instead it is like a noise). I guess finding these is not that simple, if not impossible.

@Lakitna
Copy link
Contributor Author

Lakitna commented Dec 4, 2020

Meh you're right. I didn't check my work well enough it seems...

image image

% difference between projection and actual plotted over actual times. All values should hover around a diff of 0%.

@Lakitna
Copy link
Contributor Author

Lakitna commented Dec 4, 2020

So I was looking a bit deeper. To see if I can find out why my projection is so bad.

Plotted here is files (x) vs runtime (y). Can you think of any reason why these lines are soo different? Did I mess up some data somewhere, or is uniform (orange square) just that much slower? I expect these trends to be somewhat similar, though random (red plus) should be a bit more erratic.

image

@bartekleon
Copy link
Member

bartekleon commented Dec 4, 2020

I feel like you multiplied something again... I should make my results more clear i suppose...
All -> multiply by 2 -> number of mutants
Uniform -> multiply both by 2 -> number of mutants / number of mutants per file
random -> multiply only first by 2 -> number of mutants / number of files

build_and_test (14.x, all 50) - 1 file - 100 mutants
build_and_test (14.x, uniform 6000 50) - 12000 mutants total and 100 mutants per file
build_and_test (14.x, random 6000 5) - 12000 mutants and 5 files

@Lakitna maybe you want some meeting during weekend? we could talk it through instead of writing. I believe it would be much easier to get conclussions and like solve some ambiguities

@Lakitna
Copy link
Contributor Author

Lakitna commented Dec 7, 2020

I tend to need my downtime during the weekend, but I can meet today? I see that you are a member of the Stryker Slack, we can use that to call. How about 11:00+01:00? I can also do 14:00+01:00 and anywhere after 16:00+01:00.

@Lakitna
Copy link
Contributor Author

Lakitna commented Dec 7, 2020

So it looks like the reporting of the runs is tripping me up again. It starts to feel like a good idea to introduce a reporter for performance testing. One that outputs JSON. Something like:

{
    "timestamp": "2020-12-07T11:56Z", // optional
    "score": 75.7,
    "testPerMutant": 2.3,
    "duration": {
        "total": 1234,
        "initialRun": 1,
        "mutation": 1233
    },
    "mutants": {
        "total": 12000,
        "killed": 500,
        "survived": 500,
        "error": 100,
        "timeout": 100,
        "noCoverage": 0,
        "ignored": 0
    },
    "files": {
        "mutated": [
            {
                "path": "mutated/file/path.js",
                "score": 12.32,
                "testPerMutant": 2.3,  // I don't think we have this data point?
                "duration": 123, // I don't think we have this data point?
                "mutants": {
                    "total": 12,
                    "killed": 5,
                    "survived": 5,
                    "error": 1,
                    "timeout": 1,
                    "noCoverage": 0,
                    "ignored": 0
                }
            },
            {...}
        ],
        "ignored": [ // optional
            {
                "path": "not/mutated/file/path.txt"
            },
            {...}
        ]
    },
    "stryker": {
        "config": { // Parsed Stryker config
            "testRunner": "mocha",
            "coverageAnalysis": "perTest",
            "concurrency": 15,
            ...
        },
        "version": "0.0.1",
    }
}

Basically, a reporter that outputs all kinds of interesting numbers we can use for performance/statistical analysis. As far as I'm concerned, it can be a separate NPM package.

@nicojs @kmdrGroch What do you think?

@bartekleon
Copy link
Member

bartekleon commented Dec 7, 2020

I tend to need my downtime during the weekend, but I can meet today? I see that you are a member of the Stryker Slack, we can use that to call. How about 11:00+01:00? I can also do 14:00+01:00 and anywhere after 16:00+01:00.

Thats oki 😆 I believe we could meet at 4pm :) (+1)

Basically, a reporter that outputs all kinds of interesting numbers we can use for performance/statistical analysis. As far as I'm concerned, it can be a separate NPM package.

i would love it O.O

I feel like, if we were also to remove files property and add PC spec, we could even consider sending users results in this format, which we could later on analyze :)

Edit: unfortunatelly I wont be able to meet today, I have an important meeting today :/

@Lakitna
Copy link
Contributor Author

Lakitna commented Dec 7, 2020

I feel like, if we were also to remove files property and add PC spec, we could even consider sending users results in this format, which we could later on analyze :)

That could be great. I feel like we would only have to remove the path fields right. The per-file statistics still apply when anonymised. Maybe hash the path to enable run-to-run comparison at some point in the future?

See updated output below. I used a quick MD5 hash for the file ID which not secure enough, I know. For system things I used Nodes os module. Though we could also use something like https://www.npmjs.com/package/systeminformation if we want to drown in data :)

{
    "timestamp": "2020-12-07T11:56Z", // optional
    "score": 75.7,
    "testPerMutant": 2.3,
    "tests": 2,
    "duration": {
        "total": 1234,
        "initialRun": 1,
        "mutation": 1233
    },
    "mutants": {
        "total": 12000,
        "killed": 500,
        "survived": 500,
        "error": 100,
        "timeout": 100,
        "noCoverage": 0,
        "ignored": 0
    },
    "files": {
        "mutated": [
            {
                "id": "d5886a034ba69e2ebdccb36e71cd6416",
                "score": 12.32,
                "testPerMutant": 2.3,  // I don't think we have this data point?
                "duration": 123, // I don't think we have this data point?
                "mutants": {
                    "total": 12,
                    "killed": 5,
                    "survived": 5,
                    "error": 1,
                    "timeout": 1,
                    "noCoverage": 0,
                    "ignored": 0
                }
            },
            {...}
        ],
        "ignored": [ // optional
            {
                "id": "c03b77207e30c3e17df1dd604cb59e05" // Kind of useless now :)
            },
            {...}
        ]
    },
    "stryker": {
        "version": "0.0.1",
        "config": { // Parsed Stryker config
            "testRunner": "mocha",
            "coverageAnalysis": "perTest",
            "concurrency": 15,
            ...
        }
    },
    "system": {
        "ci": false,
        "os": {
            "version": "Windows 10 Pro",
            "platform": "win32",
            "release": "10.0.19041"
        },
        "cpu": {
            "model": "Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz",
            "logicalCores": 16,
            "baseClock": 2304
        },
        "ram": {
            "size": 34.07,
            ...
        }
    }
}

@Lakitna
Copy link
Contributor Author

Lakitna commented Dec 9, 2020

I almost forgot to write an update on this: The failing test runs with ~60000 mutants.

It turns out that this is due to the heap size limit during the typescript transpiling of the source file. Node processes have a default heap size limit that will be exceeded by tsc when the amount of mutant is too high. Running tsc outside of Stryker has the same behaviour.

This makes a lot of sense to me. I was able to fix the test by providing the --max-old-space-size flag to increase the max heap size.

Or, in other words, these tests fail due to a Node limitation, not a Stryker limitation. We can safely ignore this.

@bartekleon
Copy link
Member

@Lakitna so yeaa... I have totally forgotten about performance testing app :D It has already ran like 50 times https://github.com/kmdrGroch/Stryker-performance-testing/actions
you can get a lot of data from there XDD

@stale
Copy link

stale bot commented Jan 11, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the ☠ stale Marked as stale by the stale bot, will be removed after a certain time. label Jan 11, 2022
@stale stale bot closed this as completed Feb 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
☠ stale Marked as stale by the stale bot, will be removed after a certain time.
Projects
None yet
Development

No branches or pull requests

3 participants