Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate the performance of the history replay tool #5707

Open
bowenwang1996 opened this issue Dec 8, 2021 · 16 comments
Open

Investigate the performance of the history replay tool #5707

bowenwang1996 opened this issue Dec 8, 2021 · 16 comments
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. Groomed Node Node team P-low Priority: low T-node Team: issues relevant to the node experience team

Comments

@bowenwang1996
Copy link
Collaborator

bowenwang1996 commented Dec 8, 2021

According to #5697 (comment), we can only play back 10 blocks per second even with multiple threads. This does not align with our expectations and it blocks the contract runtime team from validating the wasmer2 and wasmer0 give the exact same results on the entire mainnet history, which would allow us to completely remove wasmer0 support.

@olonho did some initial investigation and found that we spend a suspicious amount of time in kernel (see the profile below) ![image](https://user-images.githubusercontent.com/13601726/145128257-4b3de305-1c1f-4f02-84f0-cd857eff4020.png)

@bowenwang1996 bowenwang1996 added C-enhancement Category: An issue proposing an enhancement or a PR with one. T-node Team: issues relevant to the node experience team labels Dec 8, 2021
@bowenwang1996
Copy link
Collaborator Author

@olonho do you want to investigate this yourself?

@olonho
Copy link
Contributor

olonho commented Dec 8, 2021

I'm looking on that, but second pair of eyes wouldn't hurt.

@olonho
Copy link
Contributor

olonho commented Dec 8, 2021

Curious stats from about a day of playback:

Processed 10000 blocks, 9 blocks per second, 4357195 secs remaining
Processed 20000 blocks, 10 blocks per second, 3920475 secs remaining
Processed 30000 blocks, 12 blocks per second, 3266229 secs remaining
Processed 40000 blocks, 12 blocks per second, 3265396 secs remaining
Processed 50000 blocks, 13 blocks per second, 3013442 secs remaining
Processed 60000 blocks, 9 blocks per second, 4351639 secs remaining
Processed 70000 blocks, 9 blocks per second, 4350528 secs remaining
Processed 80000 blocks, 10 blocks per second, 3914475 secs remaining
Processed 90000 blocks, 9 blocks per second, 4348306 secs remaining
Processed 100000 blocks, 9 blocks per second, 4347195 secs remaining
Processed 110000 blocks, 9 blocks per second, 4346084 secs remaining
Processed 120000 blocks, 9 blocks per second, 4344973 secs remaining
Processed 130000 blocks, 9 blocks per second, 4343861 secs remaining
Processed 140000 blocks, 11 blocks per second, 3553159 secs remaining
Processed 150000 blocks, 9 blocks per second, 4341639 secs remaining
Processed 160000 blocks, 11 blocks per second, 3551341 secs remaining
Processed 170000 blocks, 10 blocks per second, 3905475 secs remaining
Processed 180000 blocks, 10 blocks per second, 3904475 secs remaining
Processed 190000 blocks, 10 blocks per second, 3903475 secs remaining
Processed 200000 blocks, 10 blocks per second, 3902475 secs remaining
Processed 210000 blocks, 11 blocks per second, 3546796 secs remaining
Processed 220000 blocks, 10 blocks per second, 3900475 secs remaining
Processed 230000 blocks, 10 blocks per second, 3899475 secs remaining
Processed 240000 blocks, 8 blocks per second, 4873094 secs remaining
Processed 250000 blocks, 9 blocks per second, 4330528 secs remaining
Processed 260000 blocks, 13 blocks per second, 2997289 secs remaining
Processed 270000 blocks, 23 blocks per second, 1693685 secs remaining
Processed 280000 blocks, 34 blocks per second, 1145434 secs remaining
Processed 290000 blocks, 32 blocks per second, 1216711 secs remaining
Processed 300000 blocks, 15 blocks per second, 2594983 secs remaining
Processed 310000 blocks, 10 blocks per second, 3891475 secs remaining
Processed 320000 blocks, 9 blocks per second, 4322750 secs remaining
Processed 330000 blocks, 8 blocks per second, 4861844 secs remaining
Processed 340000 blocks, 7 blocks per second, 5554965 secs remaining
Processed 350000 blocks, 8 blocks per second, 4859344 secs remaining
Processed 360000 blocks, 8 blocks per second, 4858094 secs remaining
Processed 370000 blocks, 7 blocks per second, 5550679 secs remaining
Processed 380000 blocks, 11 blocks per second, 3531341 secs remaining
Processed 390000 blocks, 29 blocks per second, 1339129 secs remaining
Processed 400000 blocks, 30 blocks per second, 1294158 secs remaining
Processed 410000 blocks, 30 blocks per second, 1293825 secs remaining
Processed 420000 blocks, 28 blocks per second, 1385884 secs remaining
Processed 430000 blocks, 20 blocks per second, 1939737 secs remaining
Processed 440000 blocks, 9 blocks per second, 4309417 secs remaining
Processed 450000 blocks, 10 blocks per second, 3877475 secs remaining
Processed 460000 blocks, 9 blocks per second, 4307195 secs remaining
Processed 470000 blocks, 8 blocks per second, 4844344 secs remaining
Processed 480000 blocks, 9 blocks per second, 4304973 secs remaining
Processed 490000 blocks, 9 blocks per second, 4303861 secs remaining
Processed 500000 blocks, 10 blocks per second, 3872475 secs remaining
Processed 510000 blocks, 9 blocks per second, 4301639 secs remaining
Processed 520000 blocks, 10 blocks per second, 3870475 secs remaining
Processed 530000 blocks, 10 blocks per second, 3869475 secs remaining
Processed 540000 blocks, 11 blocks per second, 3516796 secs remaining
Processed 550000 blocks, 10 blocks per second, 3867475 secs remaining
Processed 560000 blocks, 10 blocks per second, 3866475 secs remaining
Processed 570000 blocks, 10 blocks per second, 3865475 secs remaining
Processed 580000 blocks, 10 blocks per second, 3864475 secs remaining
Processed 590000 blocks, 10 blocks per second, 3863475 secs remaining
Processed 600000 blocks, 10 blocks per second, 3862475 secs remaining
Processed 610000 blocks, 10 blocks per second, 3861475 secs remaining
Processed 620000 blocks, 10 blocks per second, 3860475 secs remaining
Processed 630000 blocks, 9 blocks per second, 4288306 secs remaining
Processed 640000 blocks, 12 blocks per second, 3215396 secs remaining
Processed 650000 blocks, 12 blocks per second, 3214563 secs remaining
Processed 660000 blocks, 10 blocks per second, 3856475 secs remaining
Processed 670000 blocks, 9 blocks per second, 4283861 secs remaining
Processed 680000 blocks, 13 blocks per second, 2964981 secs remaining
Processed 690000 blocks, 10 blocks per second, 3853475 secs remaining
Processed 700000 blocks, 10 blocks per second, 3852475 secs remaining
Processed 710000 blocks, 10 blocks per second, 3851475 secs remaining
Processed 720000 blocks, 10 blocks per second, 3850475 secs remaining
Processed 730000 blocks, 11 blocks per second, 3499523 secs remaining
Processed 740000 blocks, 11 blocks per second, 3498614 secs remaining
Processed 750000 blocks, 11 blocks per second, 3497705 secs remaining
Processed 760000 blocks, 12 blocks per second, 3205396 secs remaining
Processed 770000 blocks, 10 blocks per second, 3845475 secs remaining
Processed 780000 blocks, 10 blocks per second, 3844475 secs remaining
Processed 790000 blocks, 10 blocks per second, 3843475 secs remaining
Processed 800000 blocks, 12 blocks per second, 3202063 secs remaining
Processed 810000 blocks, 11 blocks per second, 3492250 secs remaining
Processed 820000 blocks, 10 blocks per second, 3840475 secs remaining
Processed 830000 blocks, 11 blocks per second, 3490432 secs remaining
Processed 840000 blocks, 11 blocks per second, 3489523 secs remaining
Processed 850000 blocks, 11 blocks per second, 3488614 secs remaining
Processed 860000 blocks, 10 blocks per second, 3836475 secs remaining
Processed 870000 blocks, 10 blocks per second, 3835475 secs remaining
Processed 880000 blocks, 11 blocks per second, 3485887 secs remaining
Processed 890000 blocks, 11 blocks per second, 3484977 secs remaining
Processed 900000 blocks, 11 blocks per second, 3484068 secs remaining
Processed 910000 blocks, 10 blocks per second, 3831475 secs remaining
Processed 920000 blocks, 10 blocks per second, 3830475 secs remaining

So we see that playback speed is not uniform and sometimes reaches 305 blocks per second.

@olonho
Copy link
Contributor

olonho commented Dec 8, 2021

Pure neard top looks like this:

[.] rocksdb::crc32c::crc32c_3way                                                                                ◆
[.] snappy::RawUncompress                                                                                       ▒
[.] std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std▒
[.] sha2::sha256::compress256                                                                                   ▒
[.] rocksdb::IndexBlockIter::SeekImpl                                                                           ▒
[.] rocksdb::ReadProperties                                                                                     ▒
[.] rocksdb::LRUHandleTable::FindPointer                                                                        ▒
[.] rocksdb::LRUCacheShard::LRU_Remove                                                                          ▒
[.] std::_Sp_counted_ptr<rocksdb::TableProperties*, (__gnu_cxx::_Lock_policy)2>::_M_dispose                     ▒
[.] rocksdb::BlockBasedTable::~BlockBasedTable                                                                  ▒
[.] rocksdb::BlockFetcher::ReadBlockContents                                                                    ▒
[.] _rjem_sdallocx                                                                                              ▒
[.] <FN as wasmer_runtime_core_near::typed_func::HostFunction<wasmer_runtime_core_near::typed_func::ExplicitVmCt▒
[.] rocksdb::LRUCacheShard::LRU_Insert                                                                          ▒
[.] _rjem_malloc                                                                                                ▒
[.] std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std▒
[.] rocksdb::Version::Get                                                                                       ▒
[.] rocksdb::ShardedCache::Lookup                                                                               ▒
[.] rocksdb::TableCache::Get                                                                                    ▒
[.] std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<s▒
[.] rocksdb::BlockBasedTable::Get                                                                               ▒
[.] bs58::encode::encode_into                                                                                   ▒
[.] std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std▒
[.] rocksdb::BlockBasedFilterBlockReader::ParseFieldsFromBlock                                                  ▒
[.] rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::Block>                                      ▒
[.] rocksdb::DataBlockIter::SeekImpl                                                                            ▒
[.] std::__lower_bound<rocksdb::FdWithKeyRange*, rocksdb::Slice, __gnu_cxx::__ops::_Iter_comp_val<rocksdb::(anon▒
[.] <alloc::vec::Vec<T,A> as core::clone::Clone>::clone                                                         ▒
[.] rocksdb::BlockBasedFilterBlockReader::MayMatch                                     

@olonho
Copy link
Contributor

olonho commented Dec 8, 2021

Kernel top looks like this

32.39%  [k] copy_user_enhanced_fast_string                                                                    ◆
   3.08%  [k] find_get_pages_contig                                                                             ▒
   2.66%  [k] __lock_text_start                                                                                 ▒
   2.51%  [k] generic_file_buffered_read                                                                        ▒
   1.01%  [k] finish_task_switch                                                                                ▒
   0.92%  [k] __softirqentry_text_start                                                                         ▒
   0.61%  [k] iowrite16                                                                                         ▒
   0.47%  [k] copy_page_to_iter                                                                                 ▒
   0.34%  [k] xas_load                                                                                          ▒
   0.34%  [k] syscall_enter_from_user_mode                                                                      ▒
   0.26%  [k] __add_to_page_cache_locked                                                                        ▒
   0.25%  [k] copyout                                                                                           ▒
   0.23%  [k] mark_page_accessed                                                                                ▒
   0.21%  [k] kmem_cache_alloc                                                                                  ▒
   0.21%  [k] kmem_cache_free                                    

copy_user_enhanced_fast_string looks fishy, as it's just this guy https://elixir.bootlin.com/linux/v4.2/source/arch/x86/lib/copy_user_64.S#L216
and not so many system calls ever pass arguments as zero-terminated strings at all.

@olonho
Copy link
Contributor

olonho commented Dec 8, 2021

strace -fp <neard-pid> shows

[pid 371451] readahead(292, 67685523, 965 <unfinished ...>
[pid 371453] fstat(387,  <unfinished ...>
[pid 371451] <... readahead resumed>)   = 0
[pid 371453] <... fstat resumed>{st_mode=S_IFREG|0644, st_size=48871308, ...}) = 0
[pid 371451] pread64(292,  <unfinished ...>
[pid 371453] ioctl(387, FS_IOC_GETVERSION <unfinished ...>
[pid 371451] <... pread64 resumed>"\1\314\237\243 R\303\270\237 \315\340\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 67686435) = 53
[pid 371453] <... ioctl resumed>, 0x7f38203f7cb8) = 0
[pid 371451] fstat(292,  <unfinished ...>
[pid 371453] pread64(387,  <unfinished ...>
[pid 371451] <... fstat resumed>{st_mode=S_IFREG|0644, st_size=67686488, ...}) = 0
[pid 371453] <... pread64 resumed>"\0!\7filter.rocksdb.BuiltinBloomFi"..., 87, 48871168) = 87
[pid 371447] <... pread64 resumed>"\273~\230\0(\332\1ZST\343\211'{\10\235B\231\177\177\234A\224\22:\203r\374UJ\v\320"..., 13028, 50678818) = 13028
[pid 371453] pread64(387,  <unfinished ...>
[pid 371451] ioctl(292, FS_IOC_GETVERSION <unfinished ...>
[pid 371453] <... pread64 resumed>"\0$\4rocksdb.block.based.table.ind"..., 814, 48870354) = 814
[pid 371451] <... ioctl resumed>, 0x7f3821870ac8) = 0
[pid 371447] futex(0x7f3824c4119c, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 371451] pread64(292,  <unfinished ...>
[pid 371452] <... futex resumed>)       = 0
[pid 371451] <... pread64 resumed>"\0!\7filter.rocksdb.BuiltinBloomFi"..., 87, 67686348) = 87
[pid 371447] <... futex resumed>)       = 1
[pid 371453] close(307 <unfinished ...>
[pid 371452] pread64(145,  <unfinished ...>
[pid 371451] pread64(292,  <unfinished ...>
[pid 371447] futex(0x7f3824c4119c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 371453] <... close resumed>)       = 0
[pid 371451] <... pread64 resumed>"\0$\4rocksdb.block.based.table.ind"..., 823, 67685525) = 823
[pid 371453] openat(AT_FDCWD, "/home/ubuntu/.near/data/3124467.sst", O_RDONLY|O_CLOEXEC) = 307
[pid 371453] fcntl(307, F_GETFD <unfinished ...>
[pid 371451] pread64(292,  <unfinished ...>
[pid 371453] <... fcntl resumed>)       = 0x1 (flags FD_CLOEXEC)
[pid 371453] fcntl(307, F_SETFD, FD_CLOEXEC) = 0
[pid 371452] <... pread64 resumed>"\320\335\265\3532\242z\310\340\241\322\v\211\7\26h\276\33-H-\4j\2321!\200\264Y\206\202\315"..., 624243, 67114489) = 624243
[pid 371451] <... pread64 resumed>"\301\215\6\10\0\r\0\r\1$\204\233\256G\250\0\276\177\0\f\r\21\1\22\f\336\303\177\2512\22\0"..., 61522, 67624003) = 61522
[pid 371453] fadvise64(307, 0, 0, POSIX_FADV_RANDOM) = 0
[pid 371453] readahead(307, 67326171, 955) = 0
[pid 371452] pread64(303,  <unfinished ...>
[pid 371453] pread64(307,  <unfinished ...>
[pid 371451] pread64(292,  <unfinished ...>
[pid 371452] <... pread64 resumed>"\0\3o\0342\0\206_\0\3o\35F\213_\320e\0\4o\0364m\340\304\1\304c\0\3o\37"..., 80904, 67663977) = 80904
[pid 371453] <... pread64 resumed>"\1\212\250\215 R\235\312\213 \271\327\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 67327073) = 53
[pid 371453] fstat(307,  <unfinished ...>
[pid 371452] pread64(303,  <unfinished ...>
[pid 371453] <... fstat resumed>{st_mode=S_IFREG|0644, st_size=67327126, ...}) = 0
[pid 371453] ioctl(307, FS_IOC_GETVERSION <unfinished ...>
[pid 371451] <... pread64 resumed>"\206Cg\347\232V\317!\nuQ\241\2032\30,E6\366\350\320\10xb#\32\213\215\226\244v4"..., 499904, 67124099) = 499904
[pid 371453] <... ioctl resumed>, 0x7f38203f7cb8) = 0
[pid 371453] pread64(307,  <unfinished ...>
[pid 371452] <... pread64 resumed>"\25\207t\251\366\264m\276\240\215!\225\345\207rnV\"\350\202\241\364r\270@\372\206A\30,@\177"..., 553720, 67110257) = 553720
[pid 371453] <... pread64 resumed>"\0!\7filter.rocksdb.BuiltinBloomFi"..., 87, 67326986) = 87
[pid 371451] close(296)                 = 0
[pid 371452] openat(AT_FDCWD, "/home/ubuntu/.near/data/3168026.sst", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid 371451] pread64(292,  <unfinished ...>
[pid 371453] pread64(307,  <unfinished ...>
[pid 371452] <... openat resumed>)      = 296
[pid 371453] <... pread64 resumed>"\0$\4rocksdb.block.based.table.ind"..., 815, 67326171) = 815
[pid 371452] fcntl(296, F_GETFD)        = 0x1 (flags FD_CLOEXEC)
[pid 371452] fcntl(296, F_SETFD, FD_CLOEXEC <unfinished ...>
[pid 371453] pread64(307,  <unfinished ...>
[pid 371452] <... fcntl resumed>)       = 0
[pid 371453] <... pread64 resumed>"\0\4\202\230+<\0\315F\0\4\202\230>Y\322F\325E\0\4\202\230F\v\254\214\1\267\363\4\0"..., 27582, 67298589) = 27582
[pid 371452] fadvise64(296, 0, 0, POSIX_FADV_RANDOM <unfinished ...>
[pid 371453] pread64(307,  <unfinished ...>
[pid 371452] <... fadvise64 resumed>)   = 0
[pid 371452] readahead(296, 67208543, 524288 <unfinished ...>
[pid 371453] <... pread64 resumed>"A`SU\37\2523\177=.\345.\264\363\223\31\336\5\"s\323\202\210\375\267j#\320#F'\331"..., 182477, 67116112) = 182477
[pid 371452] <... readahead resumed>)   = 0
[pid 371453] close(367 <unfinished ...>
[pid 371452] pread64(296,  <unfinished ...>
[pid 371453] <... close resumed>)       = 0
[pid 371452] <... pread64 resumed>"\1\323\211\246 R\352\226\241 \263\354\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 67732778) = 53
[pid 371453] pread64(307,  <unfinished ...>
[pid 371452] fstat(296, {st_mode=S_IFREG|0644, st_size=67732831, ...}) = 0
[pid 371452] ioctl(296, FS_IOC_GETVERSION, 0x7f38207f83c8) = 0
[pid 371452] pread64(296, "\0!\7filter.rocksdb.BuiltinBloomFi"..., 87, 67732691) = 87
[pid 371452] pread64(296, "\0$\4rocksdb.block.based.table.ind"..., 817, 67731874) = 817
[pid 371452] pread64(296, "\0\3p\300\315\0\307e\0\4p\300\347#\314e\366d\0\4p\301\6z\307\312\1\275g\0\3p"..., 79416, 67652458) = 79416
[pid 371452] pread64(296,  <unfinished ...>
[pid 371451] <... pread64 resumed>"\0000\363\3\0\0\0\0\0\0\0\0\204\237\276b\371\315\274\306\233\3354e\373\354:\357\314\2333\242"..., 16033, 31419416) = 16033
[pid 371451] futex(0x7f3824c09540, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 371450] <... futex resumed>)       = 0
[pid 371452] <... pread64 resumed>"\2126\261\223\7\321\305\266*\6l8\367I%\21\201\315(\16\220\340\222\241A\305 \225:b\4\246"..., 537481, 67114977) = 537481
[pid 371450] futex(0x7f3824c09540, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 371451] openat(AT_FDCWD, "/home/ubuntu/.near/data/3222880.sst", O_RDONLY|O_CLOEXEC) = 367
[pid 371451] fcntl(367, F_GETFD <unfinished ...>
[pid 371452] close(170 <unfinished ...>
[pid 371451] <... fcntl resumed>)       = 0x1 (flags FD_CLOEXEC)
[pid 371452] <... close resumed>)       = 0
[pid 371451] fcntl(367, F_SETFD, FD_CLOEXEC <unfinished ...>
[pid 371452] pread64(296,  <unfinished ...>
[pid 371451] <... fcntl resumed>)       = 0
[pid 371451] fadvise64(367, 0, 0, POSIX_FADV_RANDOM) = 0
[pid 371451] readahead(367, 67665430, 965 <unfinished ...>
[pid 371452] <... pread64 resumed>"\2126\261\223\7\321\305\266*\6l8\367I%\21\201\315(\16\220\340\222\241A\305 \225:b\4\246"..., 537481, 67114977) = 537481
[pid 371451] <... readahead resumed>)   = 0
[pid 371453] <... pread64 resumed>"\347\336\2\230\0000\371\20\202\234\351\325\271\353\30\311\240\2705\231\305\275WZ\272\365\266\312@H\371u"..., 36391, 4628635) = 36391
[pid 371451] pread64(367,  <unfinished ...>
[pid 371452] pread64(296,  <unfinished ...>
[pid 371451] <... pread64 resumed>"\1\317\202\242 R\317\233\236 \302\340\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 67666342) = 53
[pid 371451] fstat(367, {st_mode=S_IFREG|0644, st_size=67666395, ...}) = 0
[pid 371453] futex(0x7f3824c4119c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 371451] ioctl(367, FS_IOC_GETVERSION, 0x7f3821870fe8) = 0
[pid 371451] pread64(367, "\0!\7filter.rocksdb.BuiltinBloomFi"..., 87, 67666255) = 87
[pid 371451] pread64(367, "\0$\4rocksdb.block.based.table.ind"..., 825, 67665430) = 825
[pid 371451] close(328)                 = 0
[pid 371451] openat(AT_FDCWD, "/home/ubuntu/.near/data/3217771.sst", O_RDONLY|O_CLOEXEC) = 170
[pid 371451] fcntl(170, F_GETFD)        = 0x1 (flags FD_CLOEXEC)
[pid 371451] fcntl(170, F_SETFD, FD_CLOEXEC) = 0
[pid 371451] fadvise64(170, 0, 0, POSIX_FADV_RANDOM) = 0
[pid 371451] readahead(170, 67673586, 965) = 0
[pid 371451] pread64(170, "\1\253\302\242 R\233\327\236 \322\344\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 53, 67674498) = 53
[pid 371452] <... pread64 resumed>"\363\177\230\0(\332\1rW<\27-\217\262\277\257\313X\365\31\374\307\4*\206\4\301\375\376\240\243\4"..., 13312, 50099432) = 13312
[pid 371451] fstat(170, {st_mode=S_IFREG|0644, st_size=67674551, ...}) = 0
[pid 371451] ioctl(170, FS_IOC_GETVERSION <unfinished ...>
[pid 371452] futex(0x7f3824c4119c, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 371451] <... ioctl resumed>, 0x7f3821870fe8) = 0
[pid 371451] pread64(170,  <unfinished ...>
[pid 371452] <... futex resumed>)       = 1
[pid 371447] <... futex resumed>)       = 0
[pid 371451] <... pread64 resumed>"\0!\7filter.rocksdb.BuiltinBloomFi"..., 87, 67674411) = 87
[pid 371451] pread64(170,  <unfinished ...>
[pid 371447] futex(0x7f3824c4119c, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 371451] <... pread64 resumed>"\0$\4rocksdb.block.based.table.ind"..., 825, 67673586) = 825
[pid 371452] pread64(428,  <unfinished ...>
[pid 371451] close(295 <unfinished ...>
[pid 371447] <... futex resumed>)       = 1
[pid 371453] <... futex resumed>)       = 0
[pid 371451] <... close resumed>)       = 0
[pid 371447] futex(0x7f3824c4119c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 371451] openat(AT_FDCWD, "/home/ubuntu/.near/data/3139501.sst", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid 371453] pread64(145,  <unfinished ...>

so seems openat is an expensive guy.

@olonho
Copy link
Contributor

olonho commented Dec 9, 2021

Interesting, when using Dwarf based stack unwinder profile looks way saner:

+   98.46%     0.00%  neard    neard                [.] rayon_core::join::join_context::{{closure}}             ◆
+   98.45%     0.09%  neard    neard                [.] state_viewer::apply_chain_range::apply_chain_range::{{cl▒
+   94.20%     0.08%  neard    neard                [.] <near_store::db::RocksDB as near_store::db::Database>::g▒
+   93.80%     0.08%  neard    neard                [.] rocksdb_get_pinned_cf                                   ▒
+   93.73%     0.01%  neard    neard                [.] rocksdb::DBImpl::Get                                    ▒
+   93.72%     0.11%  neard    neard                [.] rocksdb::DBImpl::GetImpl                                ▒
+   93.32%     0.19%  neard    neard                [.] rocksdb::Version::Get                                   ▒
+   92.72%     0.18%  neard    neard                [.] rocksdb::TableCache::Get                                ▒
+   84.44%     0.00%  neard    neard                [.] <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Jo▒
+   68.47%     0.31%  neard    neard                [.] rocksdb::BlockFetcher::ReadBlockContents                ▒
+   56.83%     0.03%  neard    neard                [.] rocksdb::FilterBlockReaderCommon<rocksdb::BlockContents>▒
+   56.79%     0.02%  neard    neard                [.] rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::BlockCo▒
-   56.77%     0.10%  neard    neard                [.] rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<r▒
   - 56.67% rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::BlockContents>                      ▒
      + 54.86% rocksdb::BlockFetcher::ReadBlockContents                                                         ▒
      + 1.44% rocksdb::BlockBasedTable::PutDataBlockToCache<rocksdb::BlockContents>                             ▒
+   54.78%     0.18%  neard    neard                [.] rocksdb::BlockBasedTable::Get                           ▒
+   50.36%     0.00%  neard    [kernel.kallsyms]    [k] entry_SYSCALL_64_after_hwframe                          ▒
+   50.24%     0.02%  neard    [kernel.kallsyms]    [k] do_syscall_64                                           ▒
+   48.27%     0.14%  neard    neard                [.] rocksdb::RandomAccessFileReader::Read                   ▒
+   48.10%     0.10%  neard    neard                [.] rocksdb::PosixRandomAccessFile::Read                    ▒
+   47.99%     0.00%  neard    libpthread-2.31.so   [.] __libc_pread (inlined)                                  ▒
+   47.17%     0.03%  neard    [kernel.kallsyms]    [k] __x64_sys_pread64                                       ▒
+   47.11%     0.01%  neard    [kernel.kallsyms]    [k] ksys_pread64                                            ▒
+   46.96%     0.03%  neard    [kernel.kallsyms]    [k] vfs_read                                                ▒
+   46.71%     0.08%  neard    [kernel.kallsyms]    [k] new_sync_read                                           ▒
+   46.64%     0.06%  neard    [kernel.kallsyms]    [k] ext4_file_read_iter                                     ▒
+   46.58%     0.02%  neard    [kernel.kallsyms]    [k] generic_file_read_iter                                  ▒
+   46.49%     2.50%  neard    [kernel.kallsyms]    [k] generic_file_buffered_read                              ▒
+   42.35%     0.14%  neard    neard                [.] rocksdb::BlockBasedFilterBlockReader::MayMatch          ▒
+   41.58%     0.05%  neard    neard                [.] rocksdb::FilterBlockReaderCommon<rocksdb::BlockContents>▒
+   37.43%     0.03%  neard    neard                [.] rocksdb::TableCache::FindTable                          ▒
+   35.51%     0.43%  neard    [kernel.kallsyms]    [k] copy_page_to_iter                                       

With dominators like

- 93.60% rocksdb::DBImpl::GetImpl                                                                                                  ▒
      - 93.32% rocksdb::Version::Get                                                                                                  ▒
         - 92.72% rocksdb::TableCache::Get                                                                                            ▒
            - 54.78% rocksdb::BlockBasedTable::Get                                                                                    ▒
               - 42.35% rocksdb::BlockBasedFilterBlockReader::MayMatch                                                                ▒
                  - 41.58% rocksdb::FilterBlockReaderCommon<rocksdb::BlockContents>::GetOrReadFilterBlock                             ▒
                     - 41.52% rocksdb::FilterBlockReaderCommon<rocksdb::BlockContents>::ReadFilterBlock                               ▒
                        - rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::BlockContents>                                             ▒
                           - rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::BlockContents>                           ▒
                              - 40.81% rocksdb::BlockFetcher::ReadBlockContents                                                       ▒
                                 - 28.43% rocksdb::RandomAccessFileReader::Read                                                       ▒
                                    - 28.41% rocksdb::PosixRandomAccessFile::Read                                                     ▒
                                       - __libc_pread (inlined)                                                                       ▒
                                          - 28.28% entry_SYSCALL_64_after_hwframe                                                     ▒
                                             - do_syscall_64                                                                          ▒
                                                - 28.25% __x64_sys_pread64                                                            ▒
                                                   - 28.24% ksys_pread64                                                              ▒
                                                      - 28.17% vfs_read                                                               ▒
                                                         - 28.12% new_sync_read                                                       ▒
                                                            - ext4_file_read_iter                                                     ▒
                                                               - generic_file_read_iter                                               ▒
                                                                  - 28.05% generic_file_buffered_read                                 ▒
                                                                     - 23.28% copy_page_to_iter                                       ▒
                                                                          22.81% copy_user_enhanced_fast_string                       ▒
                                                                     - 2.46% generic_file_buffered_read_get_pages                     ▒
                                                                          2.32% find_get_pages_contig                                 ▒
                                 - 11.23% rocksdb::VerifyBlockChecksum                                                                ▒
                                      11.21% rocksdb::crc32c::crc32c_3way                                                             ▒
                                 - 1.06% operator new                                                                                 ▒
                                      __libc_malloc (inlined)                                                                         ▒
               - 9.03% rocksdb::BlockBasedTable::NewDataBlockIterator<rocksdb::DataBlockIter>                                         ▒
                  - 8.97% rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::Block>                                                     ▒
                     - rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::Block>                                         ▒
                        - 8.40% rocksdb::BlockFetcher::ReadBlockContents                                                              ▒
                           - 6.04% rocksdb::RandomAccessFileReader::Read                                                              ▒
                              - 5.97% rocksdb::PosixRandomAccessFile::Read                                                            ▒
                                 - 5.90% __libc_pread (inlined)                                                                       ▒
                                    - 5.70% entry_SYSCALL_64_after_hwframe                                                            ▒
                                       - 5.69% do_syscall_64                                                                          ▒
                                          - 5.66% __x64_sys_pread64                                                                   ▒
                                             - 5.64% ksys_pread64                                                                     ▒
                                                - 5.60% vfs_read                                                                      ▒
                                                   - 5.54% new_sync_read                                                              ▒
                                                      - ext4_file_read_iter                                                           ▒
                                                         - generic_file_read_iter                                                     ▒
                                                            - 5.49% generic_file_buffered_read                                        ▒
                                                               - 3.75% generic_file_buffered_read_get_pages                           ▒
                                                                  - 3.11% page_cache_sync_ra                                          ▒
                                                                     - force_page_cache_ra                                            ▒
                                                                        - do_page_cache_ra                                            ▒
                                                                           - 3.04% page_cache_ra_unbounded                            ▒
                                                                              - 2.06% read_pages                                      ▒
                                                                                 - 1.24% blk_finish_plug                              ▒
                                                                                    - 1.23% blk_flush_plug_list                       ▒
                                                                                       - blk_mq_flush_plug_list                       ▒
                                                                                          - blk_mq_sched_insert_requests               

@olonho
Copy link
Contributor

olonho commented Dec 9, 2021

copy_user_enhanced_fast_string is reached via

            - state_viewer::apply_chain_range::apply_chain_range::{{closure}}                                                         ▒
               - 12.70% <near_chain::store::ChainStore as near_chain::store::ChainStoreAccess>::get_block_hash_by_height              ▒
                    near_store::Store::get_ser                                                                                        ▒
                    <near_store::db::RocksDB as near_store::db::Database>::get                                                        ▒
                    rocksdb_get_pinned_cf                                                                                             ▒
                    rocksdb::DBImpl::Get                                                                                              ▒
                    rocksdb::DBImpl::GetImpl                                                                                          ▒
                    rocksdb::Version::Get                                                                                             ▒
                  - rocksdb::TableCache::Get                                                                                          ▒
                     - 12.24% rocksdb::BlockBasedTable::Get                                                                           ▒
                        - 12.05% rocksdb::BlockBasedFilterBlockReader::MayMatch                                                       ▒
                             rocksdb::FilterBlockReaderCommon<rocksdb::BlockContents>::GetOrReadFilterBlock                           ▒
                             rocksdb::FilterBlockReaderCommon<rocksdb::BlockContents>::ReadFilterBlock                                ▒
                             rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::BlockContents>                                          ▒
                             rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::BlockContents>                           ▒
                             rocksdb::BlockFetcher::ReadBlockContents                                                                 ▒
                             rocksdb::RandomAccessFileReader::Read                                                                    ▒
                             rocksdb::PosixRandomAccessFile::Read                                                                     ▒
                             __libc_pread (inlined)                                                                                   ▒
                             entry_SYSCALL_64_after_hwframe                                                                           ▒
                             do_syscall_64                                                                                            ▒
                             __x64_sys_pread64                                                                                        ▒
                             ksys_pread64                                                                                             ▒
                             vfs_read                                                                                                 ▒
                             new_sync_read                                                                                            ▒
                             ext4_file_read_iter                                                                                      ▒
                             generic_file_read_iter                                                                                   ▒
                             generic_file_buffered_read                                                                               ▒
                             copy_page_to_iter                                                                                        ▒
                             copy_user_enhanced_fast_string                   

@bowenwang1996
Copy link
Collaborator Author

@olonho so basically it is within rocksdb?

@olonho
Copy link
Contributor

olonho commented Dec 20, 2021

Yes, seems so

@olonho
Copy link
Contributor

olonho commented Dec 23, 2021

Further investigations show pretty mediocre IO performance on the system.
Namely iotop tool shows the following:

Total DISK READ:        11.47 M/s | Total DISK WRITE:         0.00 B/s
Current DISK READ:      11.47 M/s | Current DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                               
 378500 be/4 ubuntu      2.31 M/s    0.00 B/s  0.00 % 10.75 % ./target/release/neard --home /home/~_range --shard-id 0 --only-contracts
 378501 be/4 ubuntu      2.02 M/s    0.00 B/s  0.00 %  8.78 % ./target/release/neard --home /home/~_range --shard-id 0 --only-contracts
 378503 be/4 ubuntu   1897.52 K/s    0.00 B/s  0.00 %  8.73 % ./target/release/neard --home /home/~_range --shard-id 0 --only-contracts
 378505 be/4 ubuntu   1670.11 K/s    0.00 B/s  0.00 %  7.43 % ./target/release/neard --home /home/~_range --shard-id 0 --only-contracts
 378504 be/4 ubuntu   1267.50 K/s    0.00 B/s  0.00 %  5.61 % ./target/release/neard --home /home/~_range --shard-id 0 --only-contracts
 378502 be/4 ubuntu   1356.97 K/s    0.00 B/s  0.00 %  5.25 % ./target/release/neard --home /home/~_range --shard-id 0 --only-contracts
 378499 be/4 ubuntu    935.71 K/s    0.00 B/s  0.00 %  3.47 % ./target/release/neard --home /home/~_range --shard-id 0 --only-contracts
 378506 be/4 ubuntu    182.67 K/s    0.00 B/s  0.00 %  0.58 % ./target/release/neard --home /home/~_range --shard-id 0 --only-contracts
      1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % systemd --system --deserialize 27
      2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]

so we generally get 12M/sec disk drive performance. Compared to https://www.anandtech.com/show/7173/samsung-ssd-840-evo-review-120gb-250gb-500gb-750gb-1tb-models-tested/8 it hints that we get performance 3x worse than slowest of SSD drives in this list. Just for reference compared that to what Samsung T5 external SSD in my box can give and get results around 150M/sec (queue depth 64, using AmorphousDiskMark), or 15-20M/sec with queue depth 1. Queue depth of 8 shown 118M/sec. Internal SSD shows beefy 331M/sec.

@olonho
Copy link
Contributor

olonho commented Dec 23, 2021

https://cloud.google.com/compute/docs/disks/performance suggests that for GCP instance expected read throughput per instance is between 200MByte/sec and 9GByte/sec.

@olonho
Copy link
Contributor

olonho commented Dec 24, 2021

And some nodes (archive backup) shows actual peak read performance around 500Mb/sec per gcloud console.

@bowenwang1996
Copy link
Collaborator Author

Interesting. @olonho do you think the issue is specific to the instance you use to run the history playback tool? Is there some disk IO benchmark we can run to see whether the instance itself is problematic?

@janewang
Copy link

janewang commented Feb 9, 2022

Moving to @marcelo-gonzalez for the OKR ownership transfer

@bowenwang1996
Copy link
Collaborator Author

Moving to @marcelo-gonzalez for the OKR ownership transfer

I don't think this is in the OKR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. Groomed Node Node team P-low Priority: low T-node Team: issues relevant to the node experience team
Projects
None yet
Development

No branches or pull requests

6 participants