Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gc: findHead scan optimization #3899

Conversation

HattoriHanzo031
Copy link

@HattoriHanzo031 HattoriHanzo031 commented Sep 8, 2023

Draft PR to continue discussion with @dgryski and @aykevl about optimizing findHead algorithm in case of large object allocation.
The idea is to scan the block state byte at the time. State byte is first XOR-ed with all tails byte which turns all tails in state byte to 0 and then bits.LeadingZeros8 is used to get how many tails are in the byte. Also, if XOR-ed byte is 0 that means it is all tails so we can skip to the next. For the first byte we first shift out the bits that are not part of current object.

@HattoriHanzo031 HattoriHanzo031 force-pushed the gc_find_head_optimization branch 2 times, most recently from e282884 to 565a24d Compare September 8, 2023 18:56
@dgryski
Copy link
Member

dgryski commented Sep 8, 2023

I wonder if we want to leave an old copy of the code disabled unless the runtime is built with runtime_asserts or something, just until we're confident this is always returning the same result. Also, I know you wrote also a 32-bit version -- Is there a way we could have both of these available and decided at compile time?

Also, there is a similar forward scanning loop in findNext(). I wonder if that should also be tweaked.

@aykevl
Copy link
Member

aykevl commented Sep 9, 2023

I need to take a closer look at the code still, but I like how it's kept small.
It's unfortunate that there's a ~100 byte increase in binary size for baremetal chips. Normally that would be a blocker for me but considering the performance benefits I'm not so sure.

Looking at the code, there appears to be a bit of duplication though: for example there are two comparisons against zero. I wonder if the code can be shortened (and thus also reduced in binary size) if the outer if is removed and it's all a single loop? It might also make the code a bit more readable.

I wonder if we want to leave an old copy of the code disabled unless the runtime is built with runtime_asserts or something, just until we're confident this is always returning the same result.

I'd rather not have an old version lying around that isn't getting much testing. Such an old version is likely to bitrot. Instead, I'd prefer we do a ton of testing (corpus, etc), merge it, and if there is a bug we'll know it soon enough (it's easy to verify a potential bug by simply reverting the commit and seeing whether the bug goes away).

Also, I know you wrote also a 32-bit version -- Is there a way we could have both of these available and decided at compile time?

Perhaps this is something for a future PR? Just to keep this PR focused: it's already providing a really nice performance improvement.
(Deciding at compile time is relatively easy: for example if you want to only do this on wasm then you can simply use if GOARCH == "wasm" - this check will be optimized away entirely).

@HattoriHanzo031
Copy link
Author

Also, I know you wrote also a 32-bit version -- Is there a way we could have both of these available and decided at compile time?

32bit version would require much more considerations regarding edge cases and platform differences so I suspect it would grow much more complex than 8bit version, but I agree we can try to implement it later in another PR because of significant performance gain it could provide.

Looking at the code, there appears to be a bit of duplication though: for example there are two comparisons against zero. I wonder if the code can be shortened (and thus also reduced in binary size) if the outer if is removed and it's all a single loop? It might also make the code a bit more readable.

It also looked to me that first if could be avoided and I tried to eliminate it, but the problem is that the first byte we get is a special case because it can also contain bits from different object so it must be handled differently. There still might be a way to do it that I didn't find.

Regarding the code size, I suspect it is due to use of LeadingZeros8 which is implemented as a lookup table. I just built the compiler using self-built LLVM so I can tun the tests and see what can I do to reduce the size. I first have to confirm if that is the reason, because if we are already using this lookup table somewhere it is probably reused (since it is defined in the library) and not add to binary size (and also since lookup table is 256 bytes and increase is around 100). One idea I have is to checking just 6 bits instead of 8 because if we found the byte that is not all tails it must have at least one block that will not be 00 after XOR. This would reduce lookup table to 64bytes.

@HattoriHanzo031
Copy link
Author

@aykevl you were right, it can be simplified, I just needed to subtract number of blocks from other objects at the end :)
Unfortunately, it only saves 4 bytes, but better than nothing, and the code is much more clear.

@aykevl
Copy link
Member

aykevl commented Sep 10, 2023

Regarding the code size, I suspect it is due to use of LeadingZeros8 which is implemented as a lookup table.

It is not. That is the software fallback, but in fact most of the math/bits functions are implemented as special compiler intrinsics. See: #3620

On armv7m LLVM emits a single instruction: https://godbolt.org/z/5aeG8bPfc
On armv6m (microbit for example) LLVM emits a library call, which is probably the main reason for the big increase in binary size: https://godbolt.org/z/YjrqE56Yz. Perhaps it would be a good idea to fall back to the old code on armv6m, AVR, and possibly other architectures that don't have clz?

@HattoriHanzo031
Copy link
Author

It is not. That is the software fallback, but in fact most of the math/bits functions are implemented as special compiler intrinsics. See: #3620

Thanks for the info, it is good to know.

Perhaps it would be a good idea to fall back to the old code on armv6m, AVR, and possibly other architectures that don't have clz?

It probably makes sense, on those targets there will probably not be big amount of heap allocations anyway.

@dgryski
Copy link
Member

dgryski commented Sep 19, 2024

Any outstanding comments here? Is this ready to review/merge? Does it need another pass or implementation attempt? @HattoriHanzo031 If you don't have time, I can probably pick this up and get it across the finish line.

@HattoriHanzo031
Copy link
Author

@dgryski sorry for the delay. I rebased the branch and tried to run some tests and benchmarks on embedded target. I built test program that creates some heap variables in the loop and ran it on rp2040 board. I also added toggling of one pin at the start and the end of the findHead call and capture them on logic analyser so I have exact duration of the call. I will use this over weekend to see the results and maybe tweak the algorithm.

@dgryski
Copy link
Member

dgryski commented Sep 20, 2024

Here are some numbers from https://github.com/dgryski/trifles/tree/master/tinyjsonbench

   4.0  +
        |
        |
        |                   *
        |
   3.8  +                                        *
        |
        |
        |
   3.6  +                   |                    |
        |         +-------------------+          |
        |         |         *         |          |
        |         +-------------------++-------------------+
   3.4  +                   |          |         *         |
        |                              +-------------------+
        |                              +-------------------+
        |                                        |
   3.2  +--------------------------------------------------------------
                          dev.s                opt.s

File   N   Mean  Stddev
dev.s  20  3.51  0.11    (control)
opt.s  20  3.39  0.12    (3.39 < 3.51 ± 0.07, p = .003)

So we can see an improvement of about 10ms or about 3%. ( Data analysis via https://github.com/codahale/tinystat )

@HattoriHanzo031
Copy link
Author

HattoriHanzo031 commented Sep 23, 2024

@dgryski Here are some results from benchmarking. findHead is on average around 2 times faster. It is only slower for very short searches, and the longer the search is, the better the improvement. Here is a comparisons, one channel is the durations of findHead and second channel is duration of runGC
seeed XIAO (SAMD21) board:
dev branch:
xiao_dev
PR branch:
xiao_pr
nice!nano board:
dev branch:
nano_dev
PR branch:
nano_pr

I also ran both algorithms side by side (also comparing the output):
nano_combined

One strange thing I noticed is that on RP-2040 board runGC is slower with new algorithm although findHead invocations took significantly les time:
dev branch:
zero_dev_2
PR branch:
zero_pr_2

Still didn't figure out what is the cause for this since only change is in the findHead function and it runs (on average) much faster and returns the same results. Maybe @aykevl has some explanation.

@dgryski
Copy link
Member

dgryski commented Oct 23, 2024

Given this is slower for very short scans, and that most allocations are only for a single block, I wonder if we want to have an additional test to see if it's a one-block allocation before jumping into the larger logic? Or maybe the extra test outweighs the difference.

@dgryski
Copy link
Member

dgryski commented Oct 23, 2024

Also, ran this code against the corpus with the following patch to check for any mismatches:

diff --git a/src/runtime/gc_blocks.go b/src/runtime/gc_blocks.go
index 43c73143..02a5a5b4 100644
--- a/src/runtime/gc_blocks.go
+++ b/src/runtime/gc_blocks.go
@@ -129,6 +129,12 @@ func (b gcBlock) address() uintptr {
 // points to an allocated object. It returns the same block if this block
 // already points to the head.
 func (b gcBlock) findHead() gcBlock {
+
+       var btmp gcBlock
+       if gcAsserts {
+           btmp = b
+       }
+
        stateBytePtr := (*uint8)(unsafe.Add(metadataStart, b/blocksPerStateByte))

        // XOR the stateByte with byte containing all tails to turn tail bits to 0
@@ -156,6 +162,14 @@ func (b gcBlock) findHead() gcBlock {
        b -= gcBlock(bits.LeadingZeros8(stateByte) / stateBits)

        if gcAsserts {
+               for btmp.state() == blockStateTail {
+                       btmp--
+               }
+
+               if btmp != b {
+                   runtimePanic("gc: findHead optimization mismatch")
+               }
+
                if b.state() != blockStateHead && b.state() != blockStateMark {
                        runtimePanic("gc: found tail without head")
                }

@HattoriHanzo031 HattoriHanzo031 marked this pull request as ready for review October 24, 2024 03:27
@dgryski
Copy link
Member

dgryski commented Oct 24, 2024

This is a short-circuit patch to return early if we're already on a head block.

diff --git a/src/runtime/gc_blocks.go b/src/runtime/gc_blocks.go
index 43c73143..2bc02a02 100644
--- a/src/runtime/gc_blocks.go
+++ b/src/runtime/gc_blocks.go
@@ -129,6 +129,11 @@ func (b gcBlock) address() uintptr {
 // points to an allocated object. It returns the same block if this block
 // already points to the head.
 func (b gcBlock) findHead() gcBlock {
+       // Nothing to do.
+       if b.state() == blockStateHead {
+               return b
+       }
+
        stateBytePtr := (*uint8)(unsafe.Add(metadataStart, b/blocksPerStateByte))

        // XOR the stateByte with byte containing all tails to turn tail bits to 0

@HattoriHanzo031
Copy link
Author

Given this is slower for very short scans, and that most allocations are only for a single block, I wonder if we want to have an additional test to see if it's a one-block allocation before jumping into the larger logic? Or maybe the extra test outweighs the difference.

@dgryski sorry for the late response, I think I tried adding check for single block allocations, but it got complicated because of assert logic. I like how your suggestion avoids dealing with assert since short circuit check does not decrements b so the rest of the code goes trough blocks from the beginning again. I will try and see if this is speeding things up and let you know the results.

Also, ran this code against the corpus with the following patch to check for any mismatches:

Did you found any mismatch when running these tests? I did similar for my test programs and never got different value between old and new algorithm.

@dgryski
Copy link
Member

dgryski commented Oct 27, 2024

I never got any differences between your code and the old version running tinygo through the test corpus on either Linux or WASI. (Pretty sure, although I might double-check this claim...)

{"hifive1b", "examples/echo", 4560, 280, 0, 2268},
{"microbit", "examples/serial", 2868, 388, 8, 2272},
{"wioterminal", "examples/pininterrupt", 6104, 1484, 116, 6832},
{"hifive1b", "examples/echo", 4688, 280, 0, 2268},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current echo size should be 4660. Then this should pass CI and I think we'll be good to merge.

@dgryski
Copy link
Member

dgryski commented Oct 29, 2024

@aykevl Aside from the above CI issue with binary sizes, do you have any comments before this is merged?

@aykevl
Copy link
Member

aykevl commented Oct 31, 2024

@dgryski is there a way I can run tinyjsonbench myself? I'd like to play a bit with findHead optimizations.

aykevl added a commit that referenced this pull request Oct 31, 2024
This is similar to #3899, but
smaller and hopefully just as efficient.
aykevl added a commit that referenced this pull request Oct 31, 2024
This is similar to #3899, but
smaller and hopefully just as efficient.
@aykevl
Copy link
Member

aykevl commented Oct 31, 2024

Did some experimenting, and wrote some code that's a bit smaller (typically 20 byte increase in binary size, as opposed to 44-84 bytes in this PR) and I think just as fast or maybe slightly faster:

        |                                                                 
        |                                                                 
        |                                                                 
        |                                                                 
        |                                                                 
   4.0  +                   |                                             
        |                   |                                             
        |         +-------------------+                                   
        |         +---------*---------+                                   
   3.9  +         |                   |                                   
        |         +-------------------+                                   
        |                   |                                             
        |                                        |                        
   3.8  +                                        |                        
        |                              +-------------------+              
        |                              |         *         |              
        |                              +-------------------+              
   3.7  +----------------------------------------|---------------------   
                        old.times            new.times                    

File       N   Mean  Stddev  
old.times  20  3.92  0.04    (control)
new.times  20  3.77  0.03    (3.77 < 3.92 ± 0.03, p = .000)

PR is at #4574.

The main difference is that I don't try complicated XOR and bits.LeadingZeros8, but instead care about just two cases:

  • If the block state is stored in the last byte, check the whole byte. This is efficient for large allocations where a pointer could point to the middle of it.
  • Else, just use the old code of checking the block state. This is efficient for pointers that point to the start of an allocation (a very common case). It is also reasonably efficient for the first few and last few blocks of an allocation that aren't aligned (b % 4 == 3) because it will loop at most 3 times.

@dgryski as you can see, I did get tinyjsonbench to work. However I did find that run.sh is locale dependent: I had to replace a bunch of commas with dots in old.times and new.times for it to show something sensible (I've set my system to a weird mixture of English language and Dutch formatting of dates and numbers and stuff).

@HattoriHanzo031
Copy link
Author

HattoriHanzo031 commented Nov 1, 2024

PR is at #4574.

The main difference is that I don't try complicated XOR and bits.LeadingZeros8, but instead care about just two cases:

  • If the block state is stored in the last byte, check the whole byte. This is efficient for large allocations where a pointer could point to the middle of it.
  • Else, just use the old code of checking the block state. This is efficient for pointers that point to the start of an allocation (a very common case). It is also reasonably efficient for the first few and last few blocks of an allocation that aren't aligned (b % 4 == 3) because it will loop at most 3 times.

@aykevl I like the simplicity of your suggestion. I did some benchmarking on RP2040 and the code from this PR is slightly (but consistently) faster than your suggestion. I'm not sure if this small speedup justifies so much size increase, especially if benchmarks on linux/wasm show similar/better performance.

@aykevl
Copy link
Member

aykevl commented Nov 1, 2024

@HattoriHanzo031 can you share the benchmark code?

@dgryski
Copy link
Member

dgryski commented Nov 4, 2024

@aykevl I'm definitely a fan of your approach, if only because I could immediately see that it was correct. I'm willing to trade off a little speed for smaller code size and simpler code.

I'd be curious to see the performance difference between the two approaches, rather than comparing against the dev branch.

@HattoriHanzo031
Copy link
Author

HattoriHanzo031 commented Nov 6, 2024

@HattoriHanzo031 can you share the benchmark code?

The benchmark that I am doing on embedded targets does not test the speed of user code, but compares the speed of execution of different versions of findHead implementations. I'm currently not at my personal computer to send you the exact code, but in essence, I wrote findHead function that calls different implementations of findHead and toggles different GPIO before and after the call. Then I'm looking at the timings on the logic analyser and compering the durations of each call. I am also running each findHead implementation in a loop 100 times to minimise the influence of the speed of the code that toggles the pins. findHead have no side effects so I presume this should be realistic test. I also compare the outputs of each implementation to make sure they all produce the same output.

@aykevl I'm definitely a fan of your approach, if only because I could immediately see that it was correct. I'm willing to trade off a little speed for smaller code size and simpler code.

I agree, I think we should abandon LeadingZeroes approach.

I have few ideas for micro optimisations of the approach from @aykevl PR.

func (b gcBlock) findHead() gcBlock {
	for {
		stateByte := *(*uint8)(unsafe.Add(metadataStart, b/blocksPerStateByte))
		if stateByte == blockStateByteAllTails {
			b -= (b % blocksPerStateByte) + 1
			continue
		}

		state := blockState(stateByte>>((b%blocksPerStateByte)*stateBits)) & blockStateMask
		if state != blockStateTail {
			break
		}
		b--
	}

	if gcAsserts {
		if b.state() != blockStateHead && b.state() != blockStateMark {
			runtimePanic("gc: found tail without head")
		}
	}
	return b
}

It is essentially the same code as (b gcBlock) state(), just with added check for AllTails in between.
I didn't test it at all, I will try to run it when I get back to my personal computer.

@HattoriHanzo031
Copy link
Author

                                                                         
   3.2  +                                                                 
        |                                                                 
        |                   *                                             
        |                                                                 
   3.0  +                                        *                        
        |                   *                    *                        
        |                   |                    *                        
   2.8  +                   |                    *                        
        |                   |                    *                        
        |                   |                    *                        
        |         +-------------------+          |                        
   2.6  +         |                   |          |                        
        |         +---------*---------++-------------------+              
        |         |                   |+---------*---------+              
   2.4  +         +-------------------++-------------------+              
        |                   |                    |                        
        |                                                                 
   2.2  +--------------------------------------------------------------   
                        old.times            new.times                    

File       N    Mean  Stddev  
old.times  200  2.54  0.14    (control)
new.times  200  2.49  0.13    (2.49 < 2.54 ± 0.03, p = .000)

I just ran tinyjsonbench for the micro optimisations and it looks like they give comparable performance (new.times), while producing few bites smaller binary.

@HattoriHanzo031
Copy link
Author

HattoriHanzo031 commented Nov 10, 2024

Here are the results from RP2040:
image

Total runtime for each channel for the sample I recorded is:

Channel original total raised for 874.61816ms
Channel xor total raised for 494.92246ms
Channel #4574 total raised for 437.95334ms
Channel #4574_opti total raised for 392.27066ms

Implementation with XOR logic and LeadingZeroes is the faster for all but the smallest allocations, but very slow for smallest allocations which makes it slower overall. Last two are from #4574 and my suggestion for micro optimisations for the code from #4574. They are similar in performance as expected.

Given this results and binary size difference I am in favour of approach from #4574 with micro optimisations.

P.S. this is the code used to generate timing signals on embedded targets:

func timeFn(pin machine.Pin, fn func() gcBlock) gcBlock {
	var acc gcBlock
	pin.High()
	for range 10000 {
		acc += fn()
	}
	pin.Low()
	return acc
}

func (b gcBlock) findHead() gcBlock {
	machine.D14.Configure(machine.PinConfig{Mode: machine.PinOutput})
	machine.D15.Configure(machine.PinConfig{Mode: machine.PinOutput})
	machine.D26.Configure(machine.PinConfig{Mode: machine.PinOutput})
	machine.D27.Configure(machine.PinConfig{Mode: machine.PinOutput})

	old := timeFn(machine.D14, b.findHeadOld)
	xor := timeFn(machine.D15, b.findHeadXor)
	pr4574 := timeFn(machine.D27, b.findHead4574)
	opti := timeFn(machine.D26, b.findHead4574Opti)
	if old != xor || old != opti || old != pr4574 {
		runtimePanic("gc: mismatch", old, xor, opti, pr4574)
	}

	return b.findHeadOld()
}

aykevl added a commit that referenced this pull request Nov 12, 2024
This is similar to #3899, but
smaller and hopefully just as efficient.

Thanks to @HattoriHanzo031 for starting this work, benchmarking, and for
improving the performance of the code even further.
aykevl added a commit that referenced this pull request Nov 12, 2024
This is similar to #3899, but
smaller and hopefully just as efficient.

Thanks to @HattoriHanzo031 for starting this work, benchmarking, and for
improving the performance of the code even further.
@aykevl
Copy link
Member

aykevl commented Nov 12, 2024

@HattoriHanzo031 updated #4574 with your suggestions, and refactored the code a bit to hopefully be easier to read.
Nice benefit: your micro-optimizations even improved binary size by 4 bytes on some chips!

I didn't re-run the benchmarks but the changes seem reasonable and should indeed improve performance.

Implementation with XOR logic and LeadingZeroes is the faster for all but the smallest allocations, but very slow for smallest allocations which makes it slower overall.

Also, I expect most pointers to point to the start of an allocation even for large allocations. For example, a slice will typically be pointing to the start of the buffer unless the code explicitly uses slice = slice[5:] or something.

@HattoriHanzo031
Copy link
Author

@HattoriHanzo031 updated #4574 with your suggestions, and refactored the code a bit to hopefully be easier to read.

Looks very clear and readable. I am closing this PR since it is redundant now.

aykevl added a commit that referenced this pull request Nov 14, 2024
This is similar to #3899, but
smaller and hopefully just as efficient.

Thanks to @HattoriHanzo031 for starting this work, benchmarking, and for
improving the performance of the code even further.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants