Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync: Map has poor performance #28938

Closed
robaho opened this issue Nov 24, 2018 · 44 comments
Closed

sync: Map has poor performance #28938

robaho opened this issue Nov 24, 2018 · 44 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance

Comments

@robaho
Copy link

robaho commented Nov 24, 2018

What version of Go are you using (go version)?

1.11.2

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

darwin, osx, core i7

What did you do?

Please see the full test analysis here but in summary, the performance of sync.Map is far worse than expected, even with read-only workloads.

Note, all of the maps are pre-populated, so get will always succeed, and put will always replace an existing element.

BenchmarkMain/unshared.get-8         	20000000	        91.8 ns/op
BenchmarkMain/unshared.put-8         	20000000	       100 ns/op
BenchmarkMain/unshared.putget-8      	10000000	       178 ns/op
BenchmarkMain/unshared.multiget-8    	20000000	        87.0 ns/op
BenchmarkMain/lock.get-8             	20000000	       110 ns/op
BenchmarkMain/lock.put-8             	10000000	       122 ns/op
BenchmarkMain/lock.putget-8          	10000000	       233 ns/op
BenchmarkMain/lock.multiget-8        	10000000	       146 ns/op
BenchmarkMain/lock.multiput-8        	 5000000	       304 ns/op
BenchmarkMain/lock.multiputget-8     	 2000000	       596 ns/op
BenchmarkMain/sync.get-8             	 5000000	       297 ns/op
BenchmarkMain/sync.put-8             	 5000000	       327 ns/op
BenchmarkMain/sync.putget-8          	 2000000	       688 ns/op
BenchmarkMain/sync.multiget-8        	 5000000	       382 ns/op
BenchmarkMain/sync.multiput-8        	 5000000	       395 ns/op
BenchmarkMain/sync.multiputget-8     	 2000000	       740 ns/op

The "multi" above is when 2 Go routines are concurrently accessing the map.

What did you expect to see?

The performance of sync.Map for reads should be nearly equal to the standard built-in map. The is the demonstrated behavior with the similar Java ConcurrentHashMap.

What did you see instead?

sync.Map performance is nearly 3x slower in all cases, including when compared to using a RW mutex to control access to a shared standard map, even with uncontended usages.

Since the sync.Map implementation is lock-free, a 'read operation' should be at most a single extra volatile read - to load the map reference. This points to a possible problem with atomic.Value ?

@dsnet
Copy link
Member

dsnet commented Nov 25, 2018

\cc @bcmills

@dsnet dsnet changed the title sync.Map has very poor performance sync: Map has poor performance Nov 25, 2018
@as
Copy link
Contributor

as commented Nov 25, 2018

For anyone looking, the code is actually located here

I'm not sure how valid these benchmarks are. But removing the remainder operation % baked into almost all key index operations yields more-expected results:

goos: linux
goarch: amd64
pkg: github.com/robaho/go-concurrency-test

BenchmarkMain/unshared.get-4         	20000000	       115 ns/op
BenchmarkMain/unshared.put-4         	20000000	       124 ns/op
BenchmarkMain/unshared.putget-4      	10000000	       224 ns/op
BenchmarkMain/unshared.multiget-4    	20000000	       123 ns/op
BenchmarkMain/lock.get-4             	50000000	        37.5 ns/op
BenchmarkMain/lock.put-4             	30000000	        48.2 ns/op
BenchmarkMain/lock.putget-4          	20000000	        83.5 ns/op
BenchmarkMain/lock.multiget-4        	10000000	       137 ns/op
BenchmarkMain/lock.multiput-4        	10000000	       134 ns/op
BenchmarkMain/lock.multiputget-4     	10000000	       220 ns/op
BenchmarkMain/sync.get-4             	20000000	        63.7 ns/op
BenchmarkMain/sync.put-4             	10000000	       130 ns/op
BenchmarkMain/sync.putget-4          	10000000	       207 ns/op
BenchmarkMain/sync.multiget-4        	20000000	        67.0 ns/op
BenchmarkMain/sync.multiput-4        	10000000	       182 ns/op
BenchmarkMain/sync.multiputget-4     	 5000000	       267 ns/op

Benchcmp before/after % removal.

BenchmarkMain/unshared.get-4            116           115           -0.86%
BenchmarkMain/unshared.put-4            118           124           +5.08%
BenchmarkMain/unshared.putget-4         229           224           -2.18%
BenchmarkMain/unshared.multiget-4       123           123           +0.00%
BenchmarkMain/lock.get-4                160           37.5          -76.56%
BenchmarkMain/lock.put-4                188           48.2          -74.36%
BenchmarkMain/lock.putget-4             362           83.5          -76.93%
BenchmarkMain/lock.multiget-4           206           137           -33.50%
BenchmarkMain/lock.multiput-4           367           134           -63.49%
BenchmarkMain/lock.multiputget-4        728           220           -69.78%
BenchmarkMain/sync.get-4                510           63.7          -87.51%
BenchmarkMain/sync.put-4                468           130           -72.22%
BenchmarkMain/sync.putget-4             1132          207           -81.71%
BenchmarkMain/sync.multiget-4           633           67.0          -89.42%
BenchmarkMain/sync.multiput-4           487           182           -62.63%
BenchmarkMain/sync.multiputget-4        1177          267           -77.32%

Additionally, I suggest anyone investigating to examine the method of which these benchmarks are being implemented and whether they are even valid: b.N is being updated by multiple goroutines, for instance.

	m.Run(names[i]+".multiputget", func(b *testing.B) {
			wg := sync.WaitGroup{}
			for g := 0; g < NGOS; g++ {
				wg.Add(1)
				go func() {
					r := time.Now().Nanosecond()

					var sum int
					for i := 0; i < b.N; i++ {
						r = rand(r)
						sum += impl.Get(r)
						r = rand(r)
						impl.Put(r, r)
					}
					Sink = sum
					wg.Done()
				}()
			}
			wg.Wait()
		})

Of course, after doing this, I noticed the data race in the benchmarks and ran it with the race detector. So all bets are off:

goos: linux
goarch: amd64
pkg: github.com/robaho/go-concurrency-test
BenchmarkRand-4   	100000000	        10.4 ns/op
populating maps...
BenchmarkMain/unshared.get-4         	 5000000	       328 ns/op
BenchmarkMain/unshared.put-4         	 5000000	       289 ns/op
BenchmarkMain/unshared.putget-4      	 2000000	       631 ns/op
==================
WARNING: DATA RACE
Write at 0x0000007100d8 by goroutine 23:
  github.com/robaho/go-concurrency-test_test.BenchmarkMain.func4.1()
      /g/src/github.com/robaho/go-concurrency-test/maps_test.go:99 +0xf9

Previous write at 0x0000007100d8 by goroutine 22:
  github.com/robaho/go-concurrency-test_test.BenchmarkMain.func4.1()
      /g/src/github.com/robaho/go-concurrency-test/maps_test.go:99 +0xf9

Goroutine 23 (running) created at:
  github.com/robaho/go-concurrency-test_test.BenchmarkMain.func4()
      /g/src/github.com/robaho/go-concurrency-test/maps_test.go:92 +0xeb
  testing.(*B).runN()
      /go/src/testing/benchmark.go:141 +0x129
  testing.(*B).run1.func1()
      /go/src/testing/benchmark.go:214 +0x6b

Goroutine 22 (finished) created at:
  github.com/robaho/go-concurrency-test_test.BenchmarkMain.func4()
      /g/src/github.com/robaho/go-concurrency-test/maps_test.go:92 +0xeb
  testing.(*B).runN()
      /go/src/testing/benchmark.go:141 +0x129
  testing.(*B).run1.func1()
      /go/src/testing/benchmark.go:214 +0x6b
==================
--- FAIL: BenchmarkMain/unshared.multiget
    benchmark.go:147: race detected during execution of benchmark

@robaho
Copy link
Author

robaho commented Nov 25, 2018

I am fairly certain the tests are valid. The b.N is not being updated - it is being used by multiple Go routines - the actual testing method does not exit until the Go routines complete - thus the WaitGroup. Still, the "multi" ones are not even the ones being referenced by this bug - the non-multi test use the single Go routine as provided by the testing harness.

The reason the mod is there is to limit the number of entries in the map - a cap.

Yes, the 'unshared' has a data-race when run under multi - but since the map is pre-populated and doesn't change in size during a read, it is a benign data race (but that is why it doesn't run the multi put tests - or an exception is thrown).

And @as your "changed tests" are not invalid, as there is no way a test WITH locks (lock) should outperform the same test WITHOUT locks (unshared) - especially being 4x faster! ... so something is wrong... maybe if you linked to the code I could point it out.

To make the point stronger - it is the performance difference between non-multi "sync" as compared to "lock" and "unshared".

@as
Copy link
Contributor

as commented Nov 25, 2018

The "unshared" benchmarks still contain the remainder operation. I should have pointed that out -- I didn't remove it for all of your types.

This was originally to illustrate how the use of high cycle cost division instructions are dominating the CPU time in the benchmark itself.

There is no such thing as a benign data race. It is counterproductive to further discuss a racy benchmark (which is why I did not publish any code changes).

@robaho
Copy link
Author

robaho commented Nov 25, 2018

@as I'm sorry but you are not correct on a couple of counts:

  1. If you read the complete read-me, it states that the "unshared" is not designed or valid for concurrent use - it works because the map is pre-populated, in fact, the built-in map code detects concurrent writes. The supporting evidence for this bug is not racey - you can remove those tests - the claim still holds.

  2. Your "high cost of division" claim is nonsense. It is a single mod operation. The built-in map itself uses many... In fact, I changed the mod to a single AND with a mask, capping at 1024*1024. I posted the updated code to the repo. Here are the results:

BenchmarkMain/unshared.get-8         	20000000	        91.8 ns/op
BenchmarkMain/unshared.put-8         	20000000	        95.7 ns/op
BenchmarkMain/unshared.putget-8      	10000000	       174 ns/op
BenchmarkMain/unshared.multiget-8    	20000000	        89.9 ns/op
BenchmarkMain/lock.get-8             	20000000	       112 ns/op
BenchmarkMain/lock.put-8             	10000000	       123 ns/op
BenchmarkMain/lock.putget-8          	10000000	       231 ns/op
BenchmarkMain/lock.multiget-8        	10000000	       144 ns/op
BenchmarkMain/lock.multiput-8        	 5000000	       305 ns/op
BenchmarkMain/lock.multiputget-8     	 3000000	       495 ns/op
BenchmarkMain/sync.get-8             	 5000000	       287 ns/op
BenchmarkMain/sync.put-8             	 5000000	       340 ns/op
BenchmarkMain/sync.putget-8          	 2000000	       748 ns/op
BenchmarkMain/sync.multiget-8        	 3000000	       407 ns/op
BenchmarkMain/sync.multiput-8        	 5000000	       354 ns/op
BenchmarkMain/sync.multiputget-8     	 2000000	       773 ns/op

@robaho
Copy link
Author

robaho commented Nov 25, 2018

@as your code changes are probably incorrect, since if you remove the mod/and, there is a very good chance that the random read will go for a value not in the map, which would be very fast... that is why you are seeing the "improved results". I think you should be less hostile in the future.

edit: this is especially true since the go map uses a 'top hash' as a quick determination - many of the reads would quickly terminate

@robaho
Copy link
Author

robaho commented Nov 25, 2018

@as lastly, your "racey" assessment is incorrect as well - if you examined the code, it was the use of the "sink" to avoid dead code elimination (not a race in using the map) - which btw, most of the internal Go benchmark tests don't do and they are invalid because of it... only some have been fixed to use a package sink.

@robaho
Copy link
Author

robaho commented Nov 25, 2018

@as I would kindly ask that you delete your comments, as they are invalid, and only bring noise to this issue.

@ALTree
Copy link
Member

ALTree commented Nov 25, 2018

@robaho A program with a data race in it is not a valid Go program.

Even if the data races don't impact your benchmark results (which may be possible), I suggest you fix them, just to make your benchmarks valid Go code.

@ALTree ALTree added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance labels Nov 25, 2018
@robaho
Copy link
Author

robaho commented Nov 25, 2018

@ALTree The data race is only the setting of the sink in the benchmark code - it is debatable that the 'race detector' should even report this as a data race, as it is only there to prevent dead code removal. Still, I went ahead and changed it to an atomic.Value so I assume it will be ok now.

also, I moved the "masking" into the test harness and out of the core code to make it even simpler.

@robaho
Copy link
Author

robaho commented Nov 25, 2018

After all of the changes, the results still show an issue:

BenchmarkMain/unshared.get-8         	20000000	        85.5 ns/op
BenchmarkMain/unshared.put-8         	20000000	        92.3 ns/op
BenchmarkMain/unshared.putget-8      	10000000	       177 ns/op
BenchmarkMain/unshared.multiget-8    	20000000	        88.2 ns/op
BenchmarkMain/lock.get-8             	10000000	       109 ns/op
BenchmarkMain/lock.put-8             	10000000	       125 ns/op
BenchmarkMain/lock.putget-8          	 5000000	       233 ns/op
BenchmarkMain/lock.multiget-8        	10000000	       158 ns/op
BenchmarkMain/lock.multiput-8        	 5000000	       293 ns/op
BenchmarkMain/lock.multiputget-8     	 3000000	       571 ns/op
BenchmarkMain/sync.get-8             	 5000000	       299 ns/op
BenchmarkMain/sync.put-8             	 5000000	       344 ns/op
BenchmarkMain/sync.putget-8          	 2000000	       678 ns/op
BenchmarkMain/sync.multiget-8        	 5000000	       375 ns/op
BenchmarkMain/sync.multiput-8        	 5000000	       337 ns/op
BenchmarkMain/sync.multiputget-8     	 2000000	       707 ns/op

@ALTree

This comment has been minimized.

@robaho

This comment has been minimized.

@robaho

This comment has been minimized.

@as
Copy link
Contributor

as commented Nov 25, 2018

You are not disagreeing with me, you are disagreeing with the race detector. The justification for having the race and other provided assumptions are of little interest to me.

After rewriting the benchmark in a way I could understand, I obtained the results below. My variables of interest are:

  • The size of the map
    Fixed to 2^20
  • Number of writers to the cache
    0 - 2
  • Access patterns (random vs iteration)
    Incr vs Rand (XorShift)
  • The access pattern variance (if applicable).
    A simple bitmask added to the XORShift

With just one concurrent writer, the sync map already beats the lock.

goos: linux
goarch: amd64
pkg: github.com/as/goissues/28938
BenchmarkMap/0Writer/Sync/Incr/Mask0-4         	50000000	        31.0 ns/op
BenchmarkMap/0Writer/Sync/Incr/Maskf-4         	50000000	        31.3 ns/op
BenchmarkMap/0Writer/Sync/Incr/Maskff-4        	30000000	        35.7 ns/op
BenchmarkMap/0Writer/Sync/Incr/Maskfff-4       	30000000	        51.1 ns/op
BenchmarkMap/0Writer/Sync/Incr/Maskffff-4      	10000000	       203 ns/op
BenchmarkMap/0Writer/Sync/Incr/Maskfffff-4     	10000000	       194 ns/op
BenchmarkMap/0Writer/Sync/Rand/Mask0-4         	50000000	        31.4 ns/op
BenchmarkMap/0Writer/Sync/Rand/Maskf-4         	50000000	        33.4 ns/op
BenchmarkMap/0Writer/Sync/Rand/Maskff-4        	30000000	        36.4 ns/op
BenchmarkMap/0Writer/Sync/Rand/Maskfff-4       	20000000	        80.6 ns/op
BenchmarkMap/0Writer/Sync/Rand/Maskffff-4      	 5000000	       343 ns/op
BenchmarkMap/0Writer/Sync/Rand/Maskfffff-4     	 3000000	       509 ns/op
BenchmarkMap/0Writer/Lock/Incr/Mask0-4         	50000000	        24.5 ns/op
BenchmarkMap/0Writer/Lock/Incr/Maskf-4         	50000000	        24.0 ns/op
BenchmarkMap/0Writer/Lock/Incr/Maskff-4        	50000000	        27.0 ns/op
BenchmarkMap/0Writer/Lock/Incr/Maskfff-4       	50000000	        31.1 ns/op
BenchmarkMap/0Writer/Lock/Incr/Maskffff-4      	20000000	       124 ns/op
BenchmarkMap/0Writer/Lock/Incr/Maskfffff-4     	10000000	       177 ns/op
BenchmarkMap/0Writer/Lock/Rand/Mask0-4         	50000000	        23.8 ns/op
BenchmarkMap/0Writer/Lock/Rand/Maskf-4         	50000000	        23.8 ns/op
BenchmarkMap/0Writer/Lock/Rand/Maskff-4        	50000000	        25.1 ns/op
BenchmarkMap/0Writer/Lock/Rand/Maskfff-4       	50000000	        30.3 ns/op
BenchmarkMap/0Writer/Lock/Rand/Maskffff-4      	20000000	       121 ns/op
BenchmarkMap/0Writer/Lock/Rand/Maskfffff-4     	10000000	       183 ns/op
BenchmarkMap/1Writer/Lock/Incr/Mask0-4         	 3000000	       522 ns/op
BenchmarkMap/1Writer/Lock/Incr/Maskf-4         	 3000000	       526 ns/op
BenchmarkMap/1Writer/Lock/Incr/Maskff-4        	 3000000	       539 ns/op
BenchmarkMap/1Writer/Lock/Incr/Maskfff-4       	 3000000	       593 ns/op
BenchmarkMap/1Writer/Lock/Incr/Maskffff-4      	 2000000	       718 ns/op
BenchmarkMap/1Writer/Lock/Incr/Maskfffff-4     	 2000000	       738 ns/op
BenchmarkMap/1Writer/Lock/Rand/Mask0-4         	 3000000	       525 ns/op
BenchmarkMap/1Writer/Lock/Rand/Maskf-4         	 3000000	       519 ns/op
BenchmarkMap/1Writer/Lock/Rand/Maskff-4        	 3000000	       532 ns/op
BenchmarkMap/1Writer/Lock/Rand/Maskfff-4       	 3000000	       541 ns/op
BenchmarkMap/1Writer/Lock/Rand/Maskffff-4      	 2000000	       639 ns/op
BenchmarkMap/1Writer/Lock/Rand/Maskfffff-4     	 2000000	       742 ns/op
BenchmarkMap/1Writer/Sync/Incr/Mask0-4         	20000000	        52.7 ns/op
BenchmarkMap/1Writer/Sync/Incr/Maskf-4         	30000000	        47.2 ns/op
BenchmarkMap/1Writer/Sync/Incr/Maskff-4        	30000000	        54.8 ns/op
BenchmarkMap/1Writer/Sync/Incr/Maskfff-4       	20000000	        66.7 ns/op
BenchmarkMap/1Writer/Sync/Incr/Maskffff-4      	10000000	       154 ns/op
BenchmarkMap/1Writer/Sync/Incr/Maskfffff-4     	10000000	       185 ns/op
BenchmarkMap/1Writer/Sync/Rand/Mask0-4         	20000000	        53.6 ns/op
BenchmarkMap/1Writer/Sync/Rand/Maskf-4         	30000000	        46.6 ns/op
BenchmarkMap/1Writer/Sync/Rand/Maskff-4        	30000000	        50.0 ns/op
BenchmarkMap/1Writer/Sync/Rand/Maskfff-4       	20000000	        75.5 ns/op
BenchmarkMap/1Writer/Sync/Rand/Maskffff-4      	10000000	       279 ns/op
BenchmarkMap/1Writer/Sync/Rand/Maskfffff-4     	 5000000	       398 ns/op
BenchmarkMap/2Writer/Sync/Incr/Mask0-4         	20000000	        70.3 ns/op
BenchmarkMap/2Writer/Sync/Incr/Maskf-4         	30000000	        58.3 ns/op
BenchmarkMap/2Writer/Sync/Incr/Maskff-4        	20000000	        60.4 ns/op
BenchmarkMap/2Writer/Sync/Incr/Maskfff-4       	20000000	        94.8 ns/op
BenchmarkMap/2Writer/Sync/Incr/Maskffff-4      	10000000	       185 ns/op
BenchmarkMap/2Writer/Sync/Incr/Maskfffff-4     	10000000	       203 ns/op
BenchmarkMap/2Writer/Sync/Rand/Mask0-4         	20000000	        72.7 ns/op
BenchmarkMap/2Writer/Sync/Rand/Maskf-4         	30000000	        66.7 ns/op
BenchmarkMap/2Writer/Sync/Rand/Maskff-4        	30000000	        69.2 ns/op
BenchmarkMap/2Writer/Sync/Rand/Maskfff-4       	20000000	       104 ns/op
BenchmarkMap/2Writer/Sync/Rand/Maskffff-4      	 5000000	       308 ns/op
BenchmarkMap/2Writer/Sync/Rand/Maskfffff-4     	 5000000	       403 ns/op
BenchmarkMap/2Writer/Lock/Incr/Mask0-4         	 2000000	       766 ns/op
BenchmarkMap/2Writer/Lock/Incr/Maskf-4         	 2000000	       757 ns/op
BenchmarkMap/2Writer/Lock/Incr/Maskff-4        	 2000000	       785 ns/op
BenchmarkMap/2Writer/Lock/Incr/Maskfff-4       	 2000000	       799 ns/op
BenchmarkMap/2Writer/Lock/Incr/Maskffff-4      	 2000000	       883 ns/op
BenchmarkMap/2Writer/Lock/Incr/Maskfffff-4     	 2000000	       912 ns/op
BenchmarkMap/2Writer/Lock/Rand/Mask0-4         	 2000000	       768 ns/op
BenchmarkMap/2Writer/Lock/Rand/Maskf-4         	 2000000	       778 ns/op
BenchmarkMap/2Writer/Lock/Rand/Maskff-4        	 2000000	       782 ns/op
BenchmarkMap/2Writer/Lock/Rand/Maskfff-4       	 2000000	       799 ns/op
BenchmarkMap/2Writer/Lock/Rand/Maskffff-4      	 2000000	       865 ns/op
BenchmarkMap/2Writer/Lock/Rand/Maskfffff-4     	 2000000	       941 ns/op
PASS
ok  	github.com/as/goissues/28938	213.446s

Why are these results so much different than the other benchmark?

https://github.com/as/goissues/tree/master/28938

@robaho
Copy link
Author

robaho commented Nov 25, 2018

@as They are not that different. Your results agree with mine. Look at these two lines in your results.

BenchmarkMap/0Writer/Sync/Rand/Maskfffff-4     	 3000000	       509 ns/op
BenchmarkMap/0Writer/Lock/Rand/Maskfffff-4     	10000000	       183 ns/op

The Lock map is 2x faster. This should never happen. In a "read only" test, the sync.Map should behave with the properties of a standard map, so that should be the bound, the "lock" being higher (lower performance) as it has the locking overhead.

@robaho
Copy link
Author

robaho commented Nov 25, 2018

@as Actually, it starts even earlier:

BenchmarkMap/0Writer/Sync/Rand/Maskfff-4       	20000000	        80.6 ns/op
BenchmarkMap/0Writer/Sync/Rand/Maskffff-4      	 5000000	       343 ns/op
BenchmarkMap/0Writer/Sync/Rand/Maskfffff-4     	 3000000	       509 ns/op
BenchmarkMap/0Writer/Lock/Rand/Maskfff-4       	50000000	        30.3 ns/op
BenchmarkMap/0Writer/Lock/Rand/Maskffff-4      	20000000	       121 ns/op
BenchmarkMap/0Writer/Lock/Rand/Maskfffff-4     	10000000	       183 ns/op

In all of these cases, the lock is faster than the sync.

@robaho
Copy link
Author

robaho commented Nov 25, 2018

@as your results don't seem consistent even within a category, as in

BenchmarkMap/0Writer/Sync/Rand/Maskfffff-4     	 3000000	       509 ns/op
BenchmarkMap/1Writer/Sync/Rand/Maskfffff-4     	 5000000	       398 ns/op
BenchmarkMap/2Writer/Sync/Rand/Maskfffff-4     	 5000000	       403 ns/op

I don't think that adding writers should make the reader perform faster... I think this is only possible if the writers are putting more elements into the L1/L2/L3 cache... Unless something in the read-only workload is causing mass cache evictions - maybe an interaction between atomic.Value and the GC ? Not sure, but there is definitely a problem here.

An easy way to test thus is to use multiple readers only... My "multi-get" tests that, and still shows the "incorrect" results.

@robaho
Copy link
Author

robaho commented Nov 25, 2018

@as as an aside, I'm sorry that you needed to rewrite my tests - the latest version I posted is far easier to follow and maintain - simpler I think than even yours, but to each his own. I'm happy you were able to duplicate my findings.

@ianlancetaylor
Copy link
Member

I'm not sure I understand the benchmarks. It's not clear to me that they are testing the case that sync.Map is intended for: multiple goroutines concurrently reading from the map, while permitting an occasional goroutine to write to the map. It is not particularly surprising that sync.Map is not as fast as the builtin map type when using a single goroutine; in that case there is no contention for the lock. sync.Map is a specialized type, as described in the doc comment; it's not a general replacement for a lock around a map. See the benchmarks in sync/map_bench_test.go; see how those benchmarks perform with your alternative implementations.

@robaho
Copy link
Author

robaho commented Nov 25, 2018

They are IMO. The 'sync' should not perform worse than 'lock' in a primarily read mode. (in this case it is an "only read", but you could sprinkle a few writes in there if you want).

The implementation of sync.Map is an atomic read of the value (the standard map), then access the standard map. So the only performance difference between sync.Map and std map in a read only scenario should be the cost of the atomic read. In a read-only environment this cost is negligible as it will already be in the cache.

There is definitely something wrong. It is why I included the "equivalent" java tests - in a read-only (or rare write), the sync.Map should perform the same as a unsynchronized std map - that is the point of lock-free data structures.

@robaho
Copy link
Author

robaho commented Nov 25, 2018

@ianlancetaylor the tests by @as test the "read only" access in the Writer0 cases, which demonstrate the problem. His tests also show that the reader gets faster as writers are added - which should not be the case according to the documentation on/usage of sync.Map either...

@ianlancetaylor
Copy link
Member

The point of sync.Map is to avoid lock contention. You need to test multiple concurrent readers. Otherwise you aren't going to see any lock contention.

@robaho
Copy link
Author

robaho commented Nov 25, 2018

@ianlancetaylor that is what my 'multi-get" tests and shows the same problem - but even still - a single reader and no writer should not perform worse than a lock based solution - it is a single atomic read. If it does, there is something wrong with atomic.Value

@robaho
Copy link
Author

robaho commented Nov 25, 2018

I guess the powers that be can just ignore this, and claim this is the way it is supposed to work, but I've been working with lock-free structures for quite a while, and this is not a traditional performance profile.

@as
Copy link
Contributor

as commented Nov 25, 2018

Modified the test so that

  • Each writer executes a Put with 1/8 probability (i.e., an occasion).
  • For each writer, there are an additional 10 concurrent readers
  • The main benchmarking routine executes a Get, as usual.

The behavior is expected and conforms to the documentation.

Results
; cd /g/src/github.com/as/goissues/28938
; go test -bench .
goos: linux
goarch: amd64
pkg: github.com/as/goissues/28938
BenchmarkMap/0W1R/Sync/Incr/Mask0-4         	50000000	        31.6 ns/op
BenchmarkMap/0W1R/Sync/Incr/Maskf-4         	50000000	        32.7 ns/op
BenchmarkMap/0W1R/Sync/Incr/Maskff-4        	30000000	        36.0 ns/op
BenchmarkMap/0W1R/Sync/Incr/Maskfff-4       	30000000	        51.5 ns/op
BenchmarkMap/0W1R/Sync/Incr/Maskffff-4      	10000000	       190 ns/op
BenchmarkMap/0W1R/Sync/Incr/Maskfffff-4     	10000000	       215 ns/op
BenchmarkMap/0W1R/Sync/Rand/Mask0-4         	50000000	        32.4 ns/op
BenchmarkMap/0W1R/Sync/Rand/Maskf-4         	50000000	        32.9 ns/op
BenchmarkMap/0W1R/Sync/Rand/Maskff-4        	30000000	        35.3 ns/op
BenchmarkMap/0W1R/Sync/Rand/Maskfff-4       	20000000	        57.0 ns/op
BenchmarkMap/0W1R/Sync/Rand/Maskffff-4      	 5000000	       320 ns/op
BenchmarkMap/0W1R/Sync/Rand/Maskfffff-4     	 3000000	       500 ns/op
BenchmarkMap/0W1R/Lock/Incr/Mask0-4         	50000000	        23.5 ns/op
BenchmarkMap/0W1R/Lock/Incr/Maskf-4         	50000000	        23.4 ns/op
BenchmarkMap/0W1R/Lock/Incr/Maskff-4        	50000000	        23.6 ns/op
BenchmarkMap/0W1R/Lock/Incr/Maskfff-4       	50000000	        31.3 ns/op
BenchmarkMap/0W1R/Lock/Incr/Maskffff-4      	20000000	       121 ns/op
BenchmarkMap/0W1R/Lock/Incr/Maskfffff-4     	10000000	       171 ns/op
BenchmarkMap/0W1R/Lock/Rand/Mask0-4         	50000000	        23.7 ns/op
BenchmarkMap/0W1R/Lock/Rand/Maskf-4         	50000000	        24.0 ns/op
BenchmarkMap/0W1R/Lock/Rand/Maskff-4        	50000000	        23.8 ns/op
BenchmarkMap/0W1R/Lock/Rand/Maskfff-4       	50000000	        29.8 ns/op
BenchmarkMap/0W1R/Lock/Rand/Maskffff-4      	20000000	        99.6 ns/op
BenchmarkMap/0W1R/Lock/Rand/Maskfffff-4     	10000000	       185 ns/op
BenchmarkMap/1W11R/Lock/Incr/Mask0-4        	 1000000	      1412 ns/op
BenchmarkMap/1W11R/Lock/Incr/Maskf-4        	 1000000	      1427 ns/op
BenchmarkMap/1W11R/Lock/Incr/Maskff-4       	 1000000	      1464 ns/op
BenchmarkMap/1W11R/Lock/Incr/Maskfff-4      	 1000000	      1619 ns/op
BenchmarkMap/1W11R/Lock/Incr/Maskffff-4     	 1000000	      2184 ns/op
BenchmarkMap/1W11R/Lock/Incr/Maskfffff-4    	  500000	      2229 ns/op
BenchmarkMap/1W11R/Lock/Rand/Mask0-4        	 1000000	      1421 ns/op
BenchmarkMap/1W11R/Lock/Rand/Maskf-4        	 1000000	      1485 ns/op
BenchmarkMap/1W11R/Lock/Rand/Maskff-4       	 1000000	      1380 ns/op
BenchmarkMap/1W11R/Lock/Rand/Maskfff-4      	 1000000	      1428 ns/op
BenchmarkMap/1W11R/Lock/Rand/Maskffff-4     	 1000000	      2151 ns/op
BenchmarkMap/1W11R/Lock/Rand/Maskfffff-4    	  500000	      2564 ns/op
BenchmarkMap/1W11R/Sync/Incr/Mask0-4        	10000000	       219 ns/op
BenchmarkMap/1W11R/Sync/Incr/Maskf-4        	10000000	       224 ns/op
BenchmarkMap/1W11R/Sync/Incr/Maskff-4       	10000000	       242 ns/op
BenchmarkMap/1W11R/Sync/Incr/Maskfff-4      	 5000000	       352 ns/op
BenchmarkMap/1W11R/Sync/Incr/Maskffff-4     	 3000000	       653 ns/op
BenchmarkMap/1W11R/Sync/Incr/Maskfffff-4    	 2000000	       658 ns/op
BenchmarkMap/1W11R/Sync/Rand/Mask0-4        	10000000	       213 ns/op
BenchmarkMap/1W11R/Sync/Rand/Maskf-4        	10000000	       226 ns/op
BenchmarkMap/1W11R/Sync/Rand/Maskff-4       	 5000000	       304 ns/op
BenchmarkMap/1W11R/Sync/Rand/Maskfff-4      	 2000000	       596 ns/op
BenchmarkMap/1W11R/Sync/Rand/Maskffff-4     	 1000000	      1095 ns/op
BenchmarkMap/1W11R/Sync/Rand/Maskfffff-4    	 1000000	      1559 ns/op
BenchmarkMap/2W21R/Lock/Incr/Mask0-4        	  500000	      2734 ns/op
BenchmarkMap/2W21R/Lock/Incr/Maskf-4        	  500000	      2621 ns/op
BenchmarkMap/2W21R/Lock/Incr/Maskff-4       	  500000	      2972 ns/op
BenchmarkMap/2W21R/Lock/Incr/Maskfff-4      	  500000	      3232 ns/op
BenchmarkMap/2W21R/Lock/Incr/Maskffff-4     	  500000	      3334 ns/op
BenchmarkMap/2W21R/Lock/Incr/Maskfffff-4    	  300000	      3745 ns/op
BenchmarkMap/2W21R/Lock/Rand/Mask0-4        	 3000000	      2564 ns/op
BenchmarkMap/2W21R/Lock/Rand/Maskf-4        	  500000	      2499 ns/op
BenchmarkMap/2W21R/Lock/Rand/Maskff-4       	  500000	      2637 ns/op
BenchmarkMap/2W21R/Lock/Rand/Maskfff-4      	  500000	      3160 ns/op
BenchmarkMap/2W21R/Lock/Rand/Maskffff-4     	  500000	      3586 ns/op
BenchmarkMap/2W21R/Lock/Rand/Maskfffff-4    	 1000000	      3150 ns/op
BenchmarkMap/2W21R/Sync/Incr/Mask0-4        	 5000000	       419 ns/op
BenchmarkMap/2W21R/Sync/Incr/Maskf-4        	 5000000	       431 ns/op
BenchmarkMap/2W21R/Sync/Incr/Maskff-4       	 5000000	       430 ns/op
BenchmarkMap/2W21R/Sync/Incr/Maskfff-4      	 3000000	       632 ns/op
BenchmarkMap/2W21R/Sync/Incr/Maskffff-4     	 1000000	      1206 ns/op
BenchmarkMap/2W21R/Sync/Incr/Maskfffff-4    	 1000000	      1400 ns/op
BenchmarkMap/2W21R/Sync/Rand/Mask0-4        	 3000000	       476 ns/op
BenchmarkMap/2W21R/Sync/Rand/Maskf-4        	 3000000	       450 ns/op
BenchmarkMap/2W21R/Sync/Rand/Maskff-4       	 5000000	       488 ns/op
BenchmarkMap/2W21R/Sync/Rand/Maskfff-4      	 3000000	       625 ns/op
BenchmarkMap/2W21R/Sync/Rand/Maskffff-4     	 1000000	      1807 ns/op
BenchmarkMap/2W21R/Sync/Rand/Maskfffff-4    	 1000000	      2809 ns/op

However there is an interesting (but likely irrelevant) observation:

A large access window (i.e., the entire 2^20 keyspace) results in a 2x slower sync.Map when that access is random (and there are no writers, and only one reader). However, when the access is sequential, we see performance on par with the lock.

BenchmarkMap/0W1R/Lock/Incr/Maskfffff-4     	10000000	       171 ns/op
BenchmarkMap/0W1R/Lock/Rand/Maskfffff-4     	10000000	       185 ns/op
BenchmarkMap/0W1R/Sync/Incr/Maskfffff-4     	10000000	       215 ns/op
BenchmarkMap/0W1R/Sync/Rand/Maskfffff-4     	 3000000	       500 ns/op

@robaho
Copy link
Author

robaho commented Nov 25, 2018

@as that is exactly the problem... but also, it is not on par for incr, it is 30% slower... but that doesn't matter, as a map is NOT designed to be read incrementally - that is an invalid use case - it is designed for random access, so being 3x slower than when using locks means a broken implementation somewhere... 30% is a problem for a common structure, 300% is a severe issue.

What no one seems to accept is that even 1/8 writes is far too many - a common data structure in most HFT and HCP systems is a lock-free "read-only" concurrent cache (i.e. very, very few writes). It should be on par with a unsynchronized map when doing reads (which is why I also included the Java ConcurrentHashMap examples), but still allow some updates.

I think your tests show that it doesn't match the documentation, to quote:

// The Map type is optimized for two common use cases: (1) when the entry for a given
// key is only ever written once but read many times, as in caches that only grow,
// or ...

so clearly this implementation is not optimized for goal 1, and your tests with 1/8 writes is not testing goal 1 either.

My tests are designed to test goal 1, that is why the map is pre-populated, and the 'get' tests are most important.

I really can't see how multiple people are arguing against this - it is almost like you are afraid to admit there is an underlying problem, and it is easier to just try and argue that "well, it works according to the design", but even that is an invalid argument.

@as
Copy link
Contributor

as commented Nov 25, 2018

I don't think you are benchmarking your shareshard.

robaho/go-concurrency-test@aed8afc

	impls := []go_concurrency.Cache{um, lm, sm, cm, sc, im, im2}
	names := []string{"unshared", "lock", "sync", "channel", "shard", "shareshard", "intmap", "intmap2"}

@robaho
Copy link
Author

robaho commented Nov 25, 2018

@as correct, bummer... knew it was too good to be true... still trying to figure out where the slowness is coming from... I wrote a different "shared shard" originally, with my own atomic.Value of a standard map, and it worked well for reads, but the writes were incredibly slow

** I deleted the comment with timings as it was noise...

@robaho
Copy link
Author

robaho commented Nov 26, 2018

OK, so I think I figured out why the performance of sync.Map is poor in this case. Because Map uses the standard map implementation which isn't designed for CAS/concurrency, it has no choice but to have an extra layer of indirection in order to perform the CAS on an individual key/value pair. As the map working set increases, this level of indirection adds 100s of nanosecs to each lookup, since the entry must be loaded (map lookup), then the pointer loaded atomically from the entry, then the pointer dereferenced into the value - whereas the standard map is only the first map retrieval - there is the 3x slowness.

This is why the newly created "shared sharded" map, which is multiple map[int]int it performs the same as lock map for concurrent reads, and outperforms for concurrent writes (the first shard lookup will always be in the cache, then a single map lookup)

Here are the performance numbers:

With 2 concurrent Go routines (multi)
BenchmarkMain/shareshard.get-8       	20000000	       112 ns/op
BenchmarkMain/shareshard.put-8       	10000000	       127 ns/op
BenchmarkMain/shareshard.putget-8    	10000000	       239 ns/op
BenchmarkMain/shareshard.multiget-8  	10000000	       123 ns/op
BenchmarkMain/shareshard.multiput-8  	10000000	       147 ns/op
BenchmarkMain/shareshard.multiputget-8         	 5000000	       294 ns/op

and with 8 concurrent Go routines (multi)
BenchmarkMain/shareshard.get-8       	20000000	       114 ns/op
BenchmarkMain/shareshard.put-8       	10000000	       130 ns/op
BenchmarkMain/shareshard.putget-8    	 5000000	       252 ns/op
BenchmarkMain/shareshard.multiget-8  	10000000	       188 ns/op
BenchmarkMain/shareshard.multiput-8  	 5000000	       284 ns/op
BenchmarkMain/shareshard.multiputget-8         	 2000000	       846 ns/op

which are far superior to standard "lock" and "sync" maps, but still are still 20% slower in the common read-only case.

Which is why I am going to write a shared intmap that uses CAS as the table row level... stay tuned.

@robaho
Copy link
Author

robaho commented Nov 26, 2018

The "shared intmap" shows the best true concurrent performance so far:

with 2 Go routines:
BenchmarkMain/sharedint.get-8                  	20000000	        61.4 ns/op
BenchmarkMain/sharedint.put-8                  	10000000	       166 ns/op
BenchmarkMain/sharedint.putget-8               	10000000	       199 ns/op
BenchmarkMain/sharedint.multiget-8             	20000000	        64.0 ns/op
BenchmarkMain/sharedint.multiput-8             	10000000	       170 ns/op
BenchmarkMain/sharedint.multiputget-8          	10000000	       218 ns/op

with 8 Go routines:
BenchmarkMain/sharedint.get-8                  	20000000	        59.0 ns/op
BenchmarkMain/sharedint.put-8                  	10000000	       167 ns/op
BenchmarkMain/sharedint.putget-8               	10000000	       198 ns/op
BenchmarkMain/sharedint.multiget-8             	20000000	       116 ns/op
BenchmarkMain/sharedint.multiput-8             	 5000000	       233 ns/op
BenchmarkMain/sharedint.multiputget-8          	 5000000	       404 ns/op

which suggest that the performance problems with sync.Map are fundamental. I believe the only way to fix would be to write a much lower level sync.Map that mimicked the internal map but that supported concurrency behavior similar to Java's ConcurrentHashMap - probably the most often used structure in Java when developing concurrent applications.

The current sync.Map implementation that leverages the existing internal map is not going to work since it was not designed for concurrent access & modification.

Writing your own concurrent maps is also very cumbersome, as the internal "hash code" functionality is not available, which means all hashing code needs to be duplicated.

@robaho
Copy link
Author

robaho commented Nov 26, 2018

Lastly, I wrote a true CAS shared intmap that is fully concurrent (although fixed size, no removals, etc). The performance numbers:

with 2 Go routines:
BenchmarkMain/sharedint.get-8         	30000000	        53.5 ns/op
BenchmarkMain/sharedint.put-8         	10000000	       126 ns/op
BenchmarkMain/sharedint.putget-8      	10000000	       198 ns/op
BenchmarkMain/sharedint.multiget-8    	30000000	        56.3 ns/op
BenchmarkMain/sharedint.multiput-8    	10000000	       128 ns/op
BenchmarkMain/sharedint.multiputget-8 	10000000	       208 ns/op

with 8 Go routines
BenchmarkMain/sharedint.get-8         	30000000	        54.1 ns/op
BenchmarkMain/sharedint.put-8         	10000000	       128 ns/op
BenchmarkMain/sharedint.putget-8      	10000000	       202 ns/op
BenchmarkMain/sharedint.multiget-8    	20000000	        97.2 ns/op
BenchmarkMain/sharedint.multiput-8    	10000000	       158 ns/op
BenchmarkMain/sharedint.multiputget-8 	 5000000	       224 ns/op

The performance numbers show what is possible with a true fully concurrent map. There is almost no degradation as the concurrency rises (most in this case can be attributed to the test machine only having 4 real cores, 4 hyper-threaded).

The sync.Map implementation needs to be updated to use a similar methodology, leveraging the internal map/hash support functions.

As it is now, a "sharded lock map" is a better solution than sync.Map for read-heavy use cases, as the "shared intmap" would require lots of custom code for general usage.

For summary analysis, here are the complete timings using 8 Go routines during (multi):

BenchmarkMain/unshared.get-8         	20000000	        87.5 ns/op
BenchmarkMain/unshared.put-8         	20000000	        97.6 ns/op
BenchmarkMain/unshared.putget-8      	10000000	       179 ns/op
BenchmarkMain/unshared.multiget-8    	10000000	       131 ns/op
BenchmarkMain/lock.get-8             	20000000	       111 ns/op
BenchmarkMain/lock.put-8             	10000000	       125 ns/op
BenchmarkMain/lock.putget-8          	10000000	       235 ns/op
BenchmarkMain/lock.multiget-8        	 5000000	       379 ns/op
BenchmarkMain/lock.multiput-8        	 1000000	      2485 ns/op
BenchmarkMain/lock.multiputget-8     	 1000000	      2401 ns/op
BenchmarkMain/sync.get-8             	 5000000	       294 ns/op
BenchmarkMain/sync.put-8             	 5000000	       317 ns/op
BenchmarkMain/sync.putget-8          	 2000000	       699 ns/op
BenchmarkMain/sync.multiget-8        	 3000000	       504 ns/op
BenchmarkMain/sync.multiput-8        	 3000000	       593 ns/op
BenchmarkMain/sync.multiputget-8     	 1000000	      1035 ns/op
BenchmarkMain/channel.get-8          	 2000000	       955 ns/op
BenchmarkMain/channel.put-8          	 3000000	       564 ns/op
BenchmarkMain/channel.putget-8       	 1000000	      1460 ns/op
BenchmarkMain/channel.multiget-8     	  200000	      6345 ns/op
BenchmarkMain/channel.multiput-8     	  300000	      4755 ns/op
BenchmarkMain/channel.multiputget-8  	  200000	      9575 ns/op
BenchmarkMain/shard.get-8            	20000000	       100 ns/op
BenchmarkMain/shard.put-8            	20000000	       101 ns/op
BenchmarkMain/shard.putget-8         	10000000	       197 ns/op
BenchmarkMain/shard.multiget-8       	10000000	       146 ns/op
BenchmarkMain/shareshard.get-8       	20000000	       112 ns/op
BenchmarkMain/shareshard.put-8       	10000000	       128 ns/op
BenchmarkMain/shareshard.putget-8    	 5000000	       241 ns/op
BenchmarkMain/shareshard.multiget-8  	10000000	       178 ns/op
BenchmarkMain/shareshard.multiput-8  	 5000000	       276 ns/op
BenchmarkMain/shareshard.multiputget-8         	 2000000	       879 ns/op
BenchmarkMain/intmap.get-8                     	20000000	       105 ns/op
BenchmarkMain/intmap.put-8                     	10000000	       212 ns/op
BenchmarkMain/intmap.putget-8                  	 5000000	       299 ns/op
BenchmarkMain/intmap.multiget-8                	10000000	       183 ns/op
BenchmarkMain/intmap.multiput-8                	 5000000	       258 ns/op
BenchmarkMain/intmap.multiputget-8             	 3000000	       456 ns/op
BenchmarkMain/intmap2.get-8                    	30000000	        53.7 ns/op
BenchmarkMain/intmap2.put-8                    	10000000	       130 ns/op
BenchmarkMain/intmap2.putget-8                 	10000000	       200 ns/op
BenchmarkMain/intmap2.multiget-8               	20000000	        99.4 ns/op
BenchmarkMain/intmap2.multiput-8               	10000000	       151 ns/op
BenchmarkMain/intmap2.multiputget-8            	10000000	       228 ns/op
BenchmarkMain/sharedint.get-8                  	30000000	        54.5 ns/op
BenchmarkMain/sharedint.put-8                  	10000000	       127 ns/op
BenchmarkMain/sharedint.putget-8               	10000000	       206 ns/op
BenchmarkMain/sharedint.multiget-8             	20000000	        96.1 ns/op
BenchmarkMain/sharedint.multiput-8             	10000000	       152 ns/op
BenchmarkMain/sharedint.multiputget-8          	10000000	       221 ns/op

@robaho
Copy link
Author

robaho commented Nov 26, 2018

I don’t really understand what thumbs down means in this context. Why is that even an option on bug reports? If you disagree with the conclusions then offer an alternative. I stand by them. The only way to fix sync.Map requires a different data structure with reuse of the standard map internals.

@bcmills
Copy link
Contributor

bcmills commented Nov 26, 2018

The performance of sync.Map for reads should be nearly equal to the standard built-in map. The is the demonstrated behavior with the similar Java ConcurrentHashMap.

Parity with the built-in map wouldn't tell us much: would it mean that sync.Map is sufficiently fast, or that the built-in map is much too slow?

@RLH
Copy link
Contributor

RLH commented Nov 26, 2018

Have you run benchstat on these numbers? It will help.

@robaho
Copy link
Author

robaho commented Nov 26, 2018

@bcmills i am assuming that the built in map has been sufficiently tuned given it’s pervasiveness that parity with it would be the goal. @RLH not sure what benchstat would offer. The results have been reproduced by @as. They are valid.

@bcmills
Copy link
Contributor

bcmills commented Nov 26, 2018

The sync.Map implementation needs to be updated to use a similar methodology, leveraging the internal map/hash support functions.

That's #21031.

Also note that your benchmarks add one more layer of pointer indirection when using sync.Map, but not when using built-in maps: the built-in map operates on typed values (and you are using int values for your benchmarks), but sync.Map operates on interface{} keys and values, which imposes an additional allocation and pointer indirection. A more efficient solution should avoid interface overhead, but — without making sync.Map a built-in type — that would require language support for generic containers (#15292).

@bcmills
Copy link
Contributor

bcmills commented Nov 26, 2018

Also note that much of the overhead that sync.Map avoids comes from cache contention, which can make individual operations O(N) with the number of physical cores: with N cores executing those operations concurrently, you can get O(N²) behavior.

N=4, as you are using for benchmarking, is low enough that the constant factors likely still dominate.

(The original sync.Map implementation was benchmarked on a machine with 24 physical cores, and the tradeoff point for many of the concrete call sites occurred at around 8 cores.)

@bcmills
Copy link
Contributor

bcmills commented Nov 26, 2018

Writing your own concurrent maps is also very cumbersome, as the internal "hash code" functionality is not available, which means all hashing code needs to be duplicated.

See #21195.

@bcmills
Copy link
Contributor

bcmills commented Nov 26, 2018

As it is now, a "sharded lock map" is a better solution than sync.Map for read-heavy use cases,

Sharding is one of several possible solutions proposed in #21035.

@robaho
Copy link
Author

robaho commented Nov 26, 2018

@bcmills thanks for your input, it seems the Go team has a good handle on the issues. I’m not sure about the cache contention claim - I tested with 8x - and I’m more concerned with the “read only” workload, but that is minor.

My bad for not finding these issues before this - I just figured for such an important common data structure a lot of these would of already been addressed, thus I was looking for more esoteric causes.

I would suggest that the internal Go map benchmarks be enhanced to test larger maps where the cache locality and number of indirections would surface as a performance problem as shown here.

@robaho
Copy link
Author

robaho commented Nov 26, 2018

I guess this can be closed as duplicate given @bcmills references. Given the lack of generics I think the cited issues should be given higher priority - this is just too an important of a data structure to exhibit such poor performance.

@bcmills
Copy link
Contributor

bcmills commented Nov 26, 2018

Thanks. Closing as a duplicate of #21031, #21035, and #15292.

If you're interested in concurrent data structures, you're certainly welcome to take a run at those.

(Personally, I use sync.Map so infrequently that I still have to look up the API every time, even though I implemented it. It's an important use-case when it does occur, but it doesn't seem to occur all that frequently in Go programs.)

@bcmills bcmills closed this as completed Nov 26, 2018
@robaho
Copy link
Author

robaho commented Nov 26, 2018

@bcmills as I mentioned earlier, large "mainly read" shared maps are an important structure in many financial applications.

As a final closing to this, I'll report the standard cases: idiomatic locks in Go, sync.Map, and Java's ConcurrentHashMap, with 6x concurrency for multi, 2^20 map with full range:

BenchmarkMain/lock.get-8         	20000000	       110 ns/op
BenchmarkMain/lock.put-8         	10000000	       129 ns/op
BenchmarkMain/lock.putget-8      	 5000000	       240 ns/op
BenchmarkMain/lock.multiget-8    	 5000000	       328 ns/op
BenchmarkMain/lock.multiput-8    	 1000000	      1549 ns/op
BenchmarkMain/lock.multiputget-8 	 1000000	      1819 ns/op
BenchmarkMain/sync.get-8         	 5000000	       287 ns/op
BenchmarkMain/sync.put-8         	 5000000	       327 ns/op
BenchmarkMain/sync.putget-8      	 2000000	       706 ns/op
BenchmarkMain/sync.multiget-8    	 3000000	       486 ns/op
BenchmarkMain/sync.multiput-8    	 3000000	       535 ns/op
BenchmarkMain/sync.multiputget-8 	 2000000	       957 ns/op

Benchmark                            (arg)  Mode  Cnt    Score    Error  Units
TestJavaCache.Test0Get          concurrent  avgt    5   72.301 ±  2.706  ns/op
TestJavaCache.Test2Put          concurrent  avgt    5  168.011 ± 12.797  ns/op
TestJavaCache.Test3PutGet       concurrent  avgt    5  293.249 ± 31.050  ns/op
TestJavaCache.Test4MultiGet     concurrent  avgt    5  121.128 ±  5.543  ns/op
TestJavaCache.Test5MultiPut     concurrent  avgt    5  261.543 ± 14.179  ns/op
TestJavaCache.Test6MultiPutGet  concurrent  avgt    5  497.539 ± 34.685  ns/op

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

8 participants