-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel Writes to mem tables, etc #642
Comments
Let me report some status here. I built a prototype of parallel writer in memtable insert. The basic work flow is:
Some initial results. Here is a sample benchmark query I run: ./db_bench -benchmarks=fillrandom -threads=16 -batch_size=64 -memtablerep="lock_free_skip_list" -value_size=0 --num=3000000 -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 Here is the observation. if -batch_size=1 (means we only update one key for one write request), multi-thread inserting to mem table is not helping at all. With write_batch=1, normal single writer skip list generates about 392K ops/s, or 6.0MB/s, while with 16 threads, it is even worse: 136K ops/s, 2.1 MB/s The likely reason is the communication costs of those threads, like maintaining the waiting queue and wait for leader's instruction. Not sure how much we can optimize on codes' level. On the other hand, if batch size is 8 or up, we can clearly see an improvement of total ingestion rate by using more threads (although not linearly). With -batch_size=64, 32 threads will generate 1.75M ops/s with 26.8MB/s, while single thread skip list can only generate 855K ops/s with 13.0MB. Worth noting that if running lock-free-skip-list in single writer mode, it can only generate 642K ops/s with 9.8MB. Finally, @guyg8 here is a question for you: I saw neither lock-free-skip-list, nor concurrent arena pass ThreadSanitizer checks. We usually keep our code base clean with ThreadSanitizer. I'm not sure whether they are false positive or real data race. But usually even if it is false positive, we rewrite the codes so that ThreadSanitizer can understand it is correct. Can you try to fix that? When you build RocksDB from clean you can set COMPILE_WITH_TSAN and it will be enabled. |
ThreadSanitizer's algorithm is designed for blocking-based synchronization (like locks). (Because its “happens-before-relation” is similar to http://dl.acm.org/citation.cfm?id=1542490). The lock-free-skip-list is designed to work on weak memory models (including the c++11 memory model), so this should not be a problem. I still want to make sure that the concurrent arena also works on weak memory models (I don't remember the code)– I’ll do it early next week. |
@guyg8 from my understanding of TSAN, if you use atomic object TSAN will not complain in many cases. We do have low-lock codes (search compare_exchange for example) in current code base and we make them pass TSAN too. You can give a try. |
@guyg8 another question. I want to make sure I did something correct. I compiled the branch of ./db_bench -benchmarks=fillrandom -threads=32 -batch_size=1 -memtablerep="lock_free_skip_list" -value_size=0 --num=1000000 -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal -concurrent_writes --write_buffer_size=160000000 I only see a throughput of 798 ops/second, 12.2MB/s. Though running 32 threads, it is only 2.7 times faster than RocksDB master with single thread skip list: ./db_bench -benchmarks=fillrandom -threads=1 -batch_size=1 -memtablerep="skip_list" -value_size=0 --num=1000000 -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000 which generates 301K ops/s 4.6MB. If I further shrink the write buffer size to 4MB, it can be further improved to 475 ops/s ( Is the performance expected, or did I run it incorrectly. |
TSAN reports on data races that are identified at runtime. Any program with “X.compare_exchange” (in which "X.compare_exchange" is really needed) contains a data race. In particular, it is easy to find data-races in the master branch of rocksdb. -bash-4.1$ ./db_bench --benchmarks=readwhilewriting,noOne --num=2 --threads=8 --duration=20 -memtablerep="skip_list" WARNING: Assertions are enabled; benchmarks unnecessarily slowDB path: [/tmp/rocksdbtest-80289/dbbench] Previous write of size 1 at 0x7d560000e41a by thread T1: Location is heap block of size 2616 at 0x7d560000dc00 allocated by main thread: Thread T5 (tid=25441, running) created by main thread at: Thread T1 (tid=25437, running) created by main thread at: SUMMARY: ThreadSanitizer: data race /home/ggolan/rocksdb_master28_6_15/rocksdb/./util/coding.h:109 rocksdb::GetVarint32Ptr(char const_, char const_, unsigned int*)==================ops Previous write of size 1 at 0x7d560000e42a by thread T1: Location is heap block of size 2616 at 0x7d560000dc00 allocated by main thread: Thread T7 (tid=25443, running) created by main thread at: Thread T1 (tid=25437, running) created by main thread at: SUMMARY: ThreadSanitizer: data race /home/ggolan/rocksdb_master28_6_15/rocksdb/./include/rocksdb/slice.h:132 rocksdb::Slice::compare(rocksdb::Slice const&) constWARNING: ThreadSanitizer: data race (pid=25436) Previous write of size 8 at 0x7fa2b870c7e8 by thread T1: Location is heap block of size 419432 at 0x7fa2b8700000 allocated by thread T1: Thread T8 (tid=25444, running) created by main thread at: Thread T1 (tid=25437, running) created by main thread at: SUMMARY: ThreadSanitizer: data race /home/ggolan/rocksdb_master28_6_15/rocksdb/./db/skiplist.h:278 rocksdb::SkipList<char const*, rocksdb::MemTableRep::KeyComparator const&>::KeyIsAfterNode(char const* const&, rocksdb::SkipList<char const*, rocksdb::MemTableRep::KeyComparator const&>::Node*) const==================ops Previous write of size 1 at 0x7fa2b8711f68 by thread T1: Location is heap block of size 419432 at 0x7fa2b8700000 allocated by thread T1: Thread T8 (tid=25444, running) created by main thread at: Thread T1 (tid=25437, running) created by main thread at: SUMMARY: ThreadSanitizer: data race /home/ggolan/rocksdb_master28_6_15/rocksdb/./include/rocksdb/slice.h:132 rocksdb::Slice::compare(rocksdb::Slice const&) constWARNING: ThreadSanitizer: data race (pid=25436) Previous write of size 8 at 0x7fa2b873fea8 by thread T1: Location is heap block of size 419432 at 0x7fa2b8700000 allocated by thread T1: Thread T7 (tid=25443, running) created by main thread at: Thread T1 (tid=25437, running) created by main thread at: SUMMARY: ThreadSanitizer: data race /home/ggolan/rocksdb_master28_6_15/rocksdb/./util/coding.h:109 rocksdb::GetVarint32Ptr(char const_, char const_, unsigned int*) |
As far as I understand, TSAN should be used to find unexpected data-races. Notice that (almost) every multi-threaded program has some data-races (e.g., the implementation of the mutex itself contains data-races). Regarding class “ConcurrentArena” and class “Arena”. Regarding the question about performance. |
Send a code review: https://reviews.facebook.net/D41373 |
Closing this via automation due to lack of activity. If discussion is still needed here, please re-open or create a new/updated issue. |
Let's use this issue to track the efforts to improve write throughput to mem tables.
As a discussion with @guyg8, I'll work on branch
sdong_write
and port necessary from branchwrite_throughput
, which contains changes @guyg8 made. Hopefully branchsdong_write
can have a running codes after a week and let's see whether there can be any improvements.The text was updated successfully, but these errors were encountered: