You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When testing Velox spilling functionality, we found that the velox spilling would have performance issue.
Although the spilling has a thread pool up to 8 threads, the actual CPU concurrency is only 1~2.
Then we use tools like pstack to find what's wrong, and it turns out that all the spilling threads are bottlenecked by prometheusCpp's updateHistogramMetric calling, and the stack is as follows:
Thread 2242 (Thread 0x7f8aee23d640 (LWP 10173)):
#0 0x00007fa72a2a6dc0 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fa72a29f753 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2 0x00007fa60f5c23e6 in std::__1::mutex::lock() () from /data/shared_libs/libnative_execution.so
#3 0x00007fa60fa31358 in prometheus::Summary::Observe(double) () from /data/shared_libs/libnative_execution.so
#4 0x00007fa60f9fc768 in facebook::presto::prometheus::PrometheusStatsReporter::addHistogramMetricValue(char const*, unsigned long) const () from /data/shared_libs/libnative_execution.so
#5 0x00007fa60f9fc91f in facebook::presto::prometheus::PrometheusStatsReporter::addHistogramMetricValue(folly::Range<char const*>, unsigned long) const () from /data/shared_libs/libnative_execution.so
#6 0x00007fa614234bca in facebook::velox::common::updateGlobalSpillAppendStats(unsigned long, unsigned long) () from /data/shared_libs/libnative_execution.so
#7 0x00007fa613c32177 in facebook::velox::exec::SpillWriter::updateAppendStats(unsigned long, unsigned long) () from /data/shared_libs/libnative_execution.so
#8 0x00007fa613c32083 in facebook::velox::exec::SpillWriter::write(std::__1::shared_ptr<facebook::velox::RowVector> const&, folly::Range<facebook::velox::IndexRange*> const&) () from /data/shared_libs/libnative_execution.so
#9 0x00007fa613c2c119 in facebook::velox::exec::SpillState::appendToPartition(unsigned int, std::__1::shared_ptr<facebook::velox::RowVector> const&) () from /data/shared_libs/libnative_execution.so
#10 0x00007fa613c3662e in facebook::velox::exec::Spiller::writeSpill(int) () from /data/shared_libs/libnative_execution.so
#11 0x00007fa613c3a494 in facebook::velox::AsyncSource<facebook::velox::exec::Spiller::SpillStatus>::runMake(std::__1::function<std::__1::unique_ptr<facebook::velox::exec::Spiller::SpillStatus, std::__1::default_delete<facebook::velox::exec::Spiller::SpillStatus> > ()>&) () from /data/shared_libs/libnative_execution.so
#12 0x00007fa613c39d66 in facebook::velox::AsyncSource<facebook::velox::exec::Spiller::SpillStatus>::move() () from /data/shared_libs/libnative_execution.so
#13 0x00007fa613c36bb2 in facebook::velox::exec::Spiller::runSpill(bool) () from /data/shared_libs/libnative_execution.so
#14 0x00007fa613c372eb in facebook::velox::exec::Spiller::spill(facebook::velox::exec::RowContainerIterator const*) () from /data/shared_libs/libnative_execution.so
#15 0x00007fa613d15a20 in facebook::velox::exec::GroupingSet::spill() () from /data/shared_libs/libnative_execution.so
#16 0x00007fa613d1e1a6 in facebook::velox::exec::HashAggregation::reclaim(unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&) () from /data/shared_libs/libnative_execution.so
#17 0x00007fa613c03534 in long std::__1::__function::__policy_invoker<long ()>::__call_impl[abi:ue170006]<std::__1::__function::__default_alloc_func<facebook::velox::exec::Operator::MemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&)::$_0, long ()> >(std::__1::__function::__policy_storage const*) () from /data/shared_libs/libnative_execution.so
#18 0x00007fa6142aa382 in facebook::velox::memory::MemoryReclaimer::run(std::__1::function<long ()> const&, facebook::velox::memory::MemoryReclaimer::Stats&) () from /data/shared_libs/libnative_execution.so
#19 0x00007fa613c03104 in facebook::velox::exec::Operator::MemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&) () from /data/shared_libs/libnative_execution.so
#20 0x00007fa613bfec58 in std::__1::unique_ptr<facebook::velox::exec::ParallelMemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&)::ReclaimResult, std::__1::default_delete<facebook::velox::exec::ParallelMemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&)::ReclaimResult> > std::__1::__function::__policy_invoker<std::__1::unique_ptr<facebook::velox::exec::ParallelMemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&)::ReclaimResult, std::__1::default_delete<facebook::velox::exec::ParallelMemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&)::ReclaimResult> > ()>::__call_impl[abi:ue170006]<std::__1::__function::__default_alloc_func<facebook::velox::exec::ParallelMemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&)::$_1, std::__1::unique_ptr<facebook::velox::exec::ParallelMemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&)::ReclaimResult, std::__1::default_delete<facebook::velox::exec::ParallelMemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&)::ReclaimResult> > ()> >(std::__1::__function::__policy_storage const*) () from /data/shared_libs/libnative_execution.so
#21 0x00007fa613bfea24 in facebook::velox::AsyncSource<facebook::velox::exec::ParallelMemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&)::ReclaimResult>::runMake(std::__1::function<std::__1::unique_ptr<facebook::velox::exec::ParallelMemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&)::ReclaimResult, std::__1::default_delete<facebook::velox::exec::ParallelMemoryReclaimer::reclaim(facebook::velox::memory::MemoryPool*, unsigned long, unsigned long, facebook::velox::memory::MemoryReclaimer::Stats&)::ReclaimResult> > ()>&) () from /data/shared_libs/libnative_execution.so
Turning off premetheusCpp metric collection would solve the problem for sure, but we want to see why this happens.
And after going through the code, we found that the Spiller's code is calling the updateHistogramMetric function too frequently.
In Spiller.nextBatch, the code would call appendPartition() every 64 rows or every 256KB:
Function appendToPartition() would call finally try to update the histogramMetric as the stack above, which causes performance issue.
We also notice that the prestissimo also use prometheusCpp as the metricExporter, so we wonder if the prestissimo has encountered the same problem or not.
Besides, to solve the problem, we want to know if it's possible for velox to use larger batch size in Spiller code, or to reduce the usage of updateHistogramMetric, especially in critical session (frequent metric updating zone).
I could submit related code patches if the velox community agrees the modification.
Thanks a lot.
The text was updated successfully, but these errors were encountered:
HolyLow
changed the title
Updating histogram metric too frequently when spilling causes performance issue
Updating histogram metric causes performance issue when spilling as it is updated too frequently
Nov 20, 2024
Description
When testing Velox spilling functionality, we found that the velox spilling would have performance issue.
Although the spilling has a thread pool up to 8 threads, the actual CPU concurrency is only 1~2.
Then we use tools like pstack to find what's wrong, and it turns out that all the spilling threads are bottlenecked by prometheusCpp's updateHistogramMetric calling, and the stack is as follows:
Turning off premetheusCpp metric collection would solve the problem for sure, but we want to see why this happens.
And after going through the code, we found that the Spiller's code is calling the updateHistogramMetric function too frequently.
In Spiller.nextBatch, the code would call appendPartition() every 64 rows or every 256KB:
velox/velox/exec/Spiller.cpp
Lines 459 to 477 in c13b8ed
Function appendToPartition() would call finally try to update the histogramMetric as the stack above, which causes performance issue.
We also notice that the prestissimo also use prometheusCpp as the metricExporter, so we wonder if the prestissimo has encountered the same problem or not.
Besides, to solve the problem, we want to know if it's possible for velox to use larger batch size in Spiller code, or to reduce the usage of updateHistogramMetric, especially in critical session (frequent metric updating zone).
I could submit related code patches if the velox community agrees the modification.
Thanks a lot.
The text was updated successfully, but these errors were encountered: