Skip to content

Attempting to match C++/SFML performance with JS/Chrome

Notifications You must be signed in to change notification settings

Jumbub/game-of-life-js

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

deploy

Investigating JavaScript optimization techniques while building Game of Life.


Results with comparisson to a fundamentally equivilant C++ implementation.

Benchmark JS C++
per cell avg 0.34ns 0.15ns
total 2.55s 1.14s

Obviously hardware dependent, but relatively speaking, JS ain't bad

Both apps use the same fundamental strategies (compare next.ts with next.cpp)


Getting started

npm install

npm run build

npm run serve

Run demo locally, or live

Run benchmark locally, or live

Run tests locally, or live


Log of benchmark improvements

Benchmark hardware:

  • i5-7600K @ 4.4GHz
  • GTX 1080

Benchmark scenario:

  • The benchmark is "time to compute 2000 generations; on a wrapping 2560x1440 sized board"
  • The initial board is always the same, and contains
    • A breeder on the top half
    • Seeded random cells in the bottom left
    • Chess grid of 8x8 alive/dead cells in the bottom right

First working benchmark (75s)

Naive and simple manual transpilation of the C++ implementation to Typescript, without any advanced Javascript (typed arrays, workers, etc).

c15cae4d00beb59f5e6e50f495a323eca3f24eb5

Use smaller typed arrays for skip cells (53s)

First attempt at using typed arrays, using Uint8Arrays to store the skip data.

01c5c850ccef4075d01441983a55cae6aa127c2a

Use smaller typed arrays for data cells (37s)

Use Uint8Array's for storing state of a cell, then create the Uint32Array required for rendering at render time.

This benchmark also adds a requirement that the renders per second never drops below 29.97.

1263580a75a6ac861616411fddc1a9605e995292

Moving the computation into a worker (28s)

Create a single worker which listens for computation requests from the primary thread.

13ce69fd81bf3b28271993629645bef6896d78b4

Move the computation loop into a dedicated worker (19s)

Create a dedicated worker for running the loop.

Allowing the computation scheduling to run synchronously (removed setTimeout(job, 0)).

e67e46a797115aca4bd168210c0201507baa9c41

Mutiple workers computing simultaneously (12s)

Currently most optimal with 16 workers, but I suspect that number will come back down* to 4 once I've optimised more operations.

4 threads: 12.70s 8 threads: 11.74s 16 threads: 11.45s

97067de019fc6045240d157da1ca130c77dd27dd

*it did come back down; performance is optimal with 4 workers now

Mutiple workers computing simultaneously (11.8s)

Wait for workers to mark themselves as "ready" before starting computations.

e19df8b319c81b59b05933bd3a60cf74d3b0b9fb

Using Atomics to communicate, rather than messages (13.5s)

The improvement is quite small, but is consistently faster than messaging.

For some reason I thought this was running faster than the previous commit, but when re-running these benchmarks it is clearly much slower.

301c1fb959d2d438fc6d6c8dfe0111e11821b39c

Re-use ImageData between frames (12s)

Unecessary memory allocation every render

79ff8a0d170bc0e54ef6ce16f08267560fbf1180

Blocking while loop in worker (9.2s)

Replace timers with blocking while loop

1f72fc965fcce195282dc0b99bc171401eb7f877

Skip faster (8.3s)

Increase skip multiplyer from 2 to 8

ce1c2e532af3341120efb258419cae4cde97a198

Mess of changes un-tidily committed (6.4s)

Don't bother calling requestAnimationFrame, more atomic communication, modifying thread & job counts, remove sleeps, do work on primary thread.

Multi threaded array fill (5s)

Perform the .fill operation simultaneously on multiple threads for the skip array.

e3d355ea2669daf6adfad30a582df510346a0adf

Bundler (3.8s)

Move to the Parcel bundler.

I'm not exactly sure what's making this so much faster. Building with --no-optimizations is the same performance.

Could it be that there are less files to download?

Could it be that the way Parcel handles modules is faster?

Could it be that the JS optimization engine handles bundled JS better?

e082d72212f74a69f742b2344681b6c174016e00

Changing the title is slow (3.7s)

Consistently getting a 2-3% increase in performance by not modifying the title on every render.

I noticed this because there was a linear increase in "Node" memory usage, which dissapeared after changing the title.

824092fba80eed0546fde1f77142580844adc340

Delete duplicate operations (3.4s)

Free performance by removing 3 lines which were doing unecessary, duplicate writes.

09a1a0a2d68ba6bb7f7915fcbb57f482900f1069

Batch non-skippables (3.2s)

First real deviation from the original C++ algorithm, we are batching non-skippable operations because unlike C++ we cannot do a reinterpret_cast to bitshift over our reads.

b6f8ba9b099f1b75d5bd7f54d9e508246e05be42

Branch removal (3.05s)

Do some pre-processing until our iterator i lines up with a value which supports less branching code.

efd5de3597d30e6a6fd7334597b45d60b698600b

Mod operator removal (2.95s)

-   do {
-     ...
-   } while (i % SKIP_MULTIPLYER !== 0 && i < endI);

+   for (let r = 0; r < SKIP_MULTIPLYER && i < endI; r++) {
+     ...
+   }

cbf13e3a6ac330f8174d52bca7f5ddb5a51e6da8

Simplify non-skippable batching (2.85s)

-    for (let r = 0; r < SKIP_MULTIPLYER && i < endI; r++) {
+    const tilI = Math.min(i + SKIP_MULTIPLYER, endI);
+    while (i < tilI) {

575ebd26e88d22902e9c4c509fc8a76b52fcc1b5

Remove now unecessary floor operation (2.75s)

-    while (inSkip[~~(i / SKIP_MULTIPLYER)]) i += SKIP_MULTIPLYER;
+    while (inSkip[i / SKIP_MULTIPLYER]) i += SKIP_MULTIPLYER;

e37da62550e34ac0c425ef78647db95345c73d03

Local reference to Math.min (2.65s)

A simple const min = Math.min outside the scope of the board operator was a great performance improvement.

545195cc7f058f7e577207ac6783e2b3d215ecde

Assignment as an expression ain't bad after all (2.55s)

-      output[i] = isAlive(i, input, width);
-      if (input[i] !== output[i]) revokeSkipForNeighbours(i, outSkip, width);
+      if (input[i] !== (output[i] = isAlive(i, input, width))) revokeSkipForNeighbours(i, outSkip, width);

1f1d08cd5bea5a08e1150e298ad6a912289cba9b


Interesting findings

Sharing data with workers

When sending messages to workers, you have 3 options.

  1. Pass by copy
const data = new ArrayBuffer(1024)
postMessage({data})
  1. Pass by reference & transfer ownership
const data = new ArrayBuffer(1024)
postMessage({data}, [data])

Note: the value of data is mutated by postMessage to revoke access (sendee now holds an empty array)

  1. Pass by reference
const data = new SharedArrayBuffer(1024)
postMessage({data})

Note: to combat race conditions, you can use Atomics

SharedArrayBuffer

In Chrome, the SharedArrayBuffer is only available to a page served with the following headers:

Cross-Origin-Embedder-Policy: require-corp
Cross-Origin-Opener-Policy: same-origin

Chrome extensions stealing performance

During this commit, I came to the realisation that the position of the mouse had a dramatic impact on app performance.

~time mouse state mouse over
6.7s moving chrome window
5.8s still chrome window
5.6s moving devtools window
5.2s moving random window
4.96s still devtools window
4.96s still random window

Disabling the mouse listeners in mouse.ts had no measurable effect on the results

Later on, during this commit, I decided to re-run these tests, and found that there was no longer a gap in "still mouse" performance.

~time mouse state mouse over
3.0s moving chrome window
3.0s moving devtools window
2.7s moving random window
2.5s still chrome window
2.5s still random window
2.5s still devtools window

Being curious, I re-ran the tests on the old commit; and I was still getting crazy results.

So I re-tested the old commit on a Guest account - voila! No more performance issues.

This implied to me that an extension was miss-behaving; so I ran the tests again, but with only a single extension enabled at a time.

~time enabled extension
4.85s -
4.93s adblock
4.95s honey
5.70s bitwarden

Found the culprit. Yikes.

After some digging, I realised that the Bitwarden was listening to document title changes, because after removing them, performance was back to base.

This is why it wasn't an issue anymore, because I deleted title changes back in this commit, and now I understand why that change had such a big impact.

Message via atomics

Communication comparisson between postMessage vs Atomics.notify

https://github.com/Jumbub/game-of-life-js/compare/2ed12fd...9a52f8e

Messaging is 3.8s, atomics are 3.6s

Fastest min

context & fastest solution

// 2.90s
const tilI = (i + SKIP_MULTIPLYER > endI) * endI + (i + SKIP_MULTIPLYER <= endI) * (i + SKIP_MULTIPLYER);

// 2.75s
let tilI = i + SKIP_MULTIPLYER;
if (tilI > endI) tilI = endI;

// 2.72s
const tilI = Math.min(i + SKIP_MULTIPLYER, endI);

// 2.68s
const min = Math.min // outside loop
...
const tilI = min(i + SKIP_MULTIPLYER, endI);

Fastest floor

// 2.95s
const revokeSkipForNeighbours = (i: number, outSkip: Skips, width: number) => {
  const floor = Math.floor;
  outSkip[floor((i - width - 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[floor((i - width + 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[floor((i - 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[floor((i + 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[floor((i + width - 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[floor((i + width + 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
};

// 2.84s
const revokeSkipForNeighbours = (i: number, outSkip: Skips, width: number) => {
  outSkip[Math.floor((i - width - 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[Math.floor((i - width + 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[Math.floor((i - 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[Math.floor((i + 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[Math.floor((i + width - 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[Math.floor((i + width + 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
};

// 2.78s
const floor = Math.floor;
const revokeSkipForNeighbours = (i: number, outSkip: Skips, width: number) => {
  outSkip[floor((i - width - 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[floor((i - width + 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[floor((i - 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[floor((i + 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[floor((i + width - 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[floor((i + width + 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
};

// 2.75
const revokeSkipForNeighbours = (i: number, outSkip: Skips, width: number, floor: typeof Math.floor) => {
  outSkip[floor((i - width - 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[floor((i - width + 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[floor((i - 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[floor((i + 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[floor((i + width - 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[floor((i + width + 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
};

// 2.55s
const revokeSkipForNeighbours = (i: number, outSkip: Skips, width: number) => {
  outSkip[0 | ((i - width - 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[0 | ((i - width + 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[0 | ((i - 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[0 | ((i + 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[0 | ((i + width - 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[0 | ((i + width + 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
};

// 2.55s
const revokeSkipForNeighbours = (i: number, outSkip: Skips, width: number) => {
  outSkip[~~((i - width - 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[~~((i - width + 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[~~((i - 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[~~((i + 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[~~((i + width - 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
  outSkip[~~((i + width + 1) / SKIP_MULTIPLYER)] = DONT_SKIP;
};

About

Attempting to match C++/SFML performance with JS/Chrome

Resources

Stars

Watchers

Forks