[FEA] Improve url_decode performance further #8030

chenrui17 · 2021-04-22T11:55:22Z

Is your feature request related to a problem? Please describe.
#7571 follow-up , url_decode performance still not ideal, gpu performance V.S. cpu performance is basically flat, and gpu util is very high, and there is still a lot of room for url_decode optimization.

Describe the solution you'd like
Here is the query trace which is contains url_decode.

Describe alternatives you've considered
None

Additional context
None

jlowe · 2021-04-22T14:37:44Z

@chenrui17 it would be good to get a better sense of the input that is triggering the poor behavior.

What is the average byte length of each string?
How many URL-encoding escape sequences occur per string on average?

Hopefully the URL decode benchmarks can be updated accordingly to reproduce the behavior shown in the trace, and then the code can be optimized against that benchmark.

chenrui17 · 2021-04-25T02:56:02Z

@chenrui17 it would be good to get a better sense of the input that is triggering the poor behavior.

What is the average byte length of each string?

How many URL-encoding escape sequences occur per string on average?

Hopefully the URL decode benchmarks can be updated accordingly to reproduce the behavior shown in the trace, and then the code can be optimized against that benchmark.

The average length of each string is 800 (max lenth is 4686 ) , about 200 escape sequences occur per string on average ( max is about 1500)

In addition, i insert some time clock code in function url_decode, but it spent time only 2ms event though my input string lengh is 1000+(input only one string), so i guess the problem may be url field of input column has too many rows. right ? if so, how to solve it ? and also i am ready to set BUILD_BENCHMARKS=ON and reproduce it in libcudf level.

chenrui17 · 2021-04-25T09:35:19Z

@jlowe Attach url_decode benchmark result. I guess the problem is my input parquet file row count is too big, and it's about 1,800,000, and there are about 20000 files.

Running ./cpp/build/gbenchmarks/STRINGS_BENCH
Run on (80 X 2401 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x40)
L1 Instruction 32 KiB (x40)
L2 Unified 1024 KiB (x40)
L3 Unified 28160 KiB (x2)
Load Average: 2.70, 2.82, 6.88
WARNING CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.

-------------------------------------------------------------------------------------------------------------------
Benchmark                                                         Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------
UrlDecode<10>/url_decode_10pct/100/10/manual_time             0.148 ms        0.175 ms         4184 bytes_per_second=9.03193M/s
UrlDecode<10>/url_decode_10pct/1000/10/manual_time            0.144 ms        0.170 ms         4699 bytes_per_second=92.7114M/s
UrlDecode<10>/url_decode_10pct/10000/10/manual_time           0.162 ms        0.187 ms         3798 bytes_per_second=825.938M/s
UrlDecode<10>/url_decode_10pct/100000/10/manual_time          0.206 ms        0.230 ms         3166 bytes_per_second=6.32715G/s
UrlDecode<10>/url_decode_10pct/1000000/10/manual_time         0.779 ms        0.804 ms          898 bytes_per_second=16.7397G/s
UrlDecode<10>/url_decode_10pct/100/100/manual_time            0.146 ms        0.173 ms         3624 bytes_per_second=67.8533M/s
UrlDecode<10>/url_decode_10pct/1000/100/manual_time           0.155 ms        0.180 ms         3579 bytes_per_second=639.935M/s
UrlDecode<10>/url_decode_10pct/10000/100/manual_time          0.211 ms        0.236 ms         3190 bytes_per_second=4.58927G/s
UrlDecode<10>/url_decode_10pct/100000/100/manual_time         0.762 ms        0.787 ms          914 bytes_per_second=12.7101G/s
UrlDecode<10>/url_decode_10pct/1000000/100/manual_time         6.93 ms         6.96 ms          101 bytes_per_second=13.9688G/s
UrlDecode<10>/url_decode_10pct/100/1000/manual_time           0.152 ms        0.177 ms         3572 bytes_per_second=631.421M/s
UrlDecode<10>/url_decode_10pct/1000/1000/manual_time          0.208 ms        0.233 ms         3295 bytes_per_second=4.49671G/s
UrlDecode<10>/url_decode_10pct/10000/1000/manual_time         0.752 ms        0.777 ms          930 bytes_per_second=12.438G/s
UrlDecode<10>/url_decode_10pct/100000/1000/manual_time         6.83 ms         6.86 ms          102 bytes_per_second=13.6918G/s
UrlDecode<10>/url_decode_10pct/1000000/1000/manual_time        73.2 ms         73.3 ms            9 bytes_per_second=12.7679G/s
UrlDecode<10>/url_decode_10pct/100/10000/manual_time          0.191 ms        0.215 ms         3306 bytes_per_second=4.88966G/s
UrlDecode<10>/url_decode_10pct/1000/10000/manual_time         0.741 ms        0.766 ms          945 bytes_per_second=12.5748G/s
UrlDecode<10>/url_decode_10pct/10000/10000/manual_time         6.72 ms         6.75 ms          104 bytes_per_second=13.8707G/s
UrlDecode<10>/url_decode_10pct/100000/10000/manual_time        71.9 ms         72.0 ms            9 bytes_per_second=12.9505G/s
terminate called after throwing an instance of 'rmm::bad_alloc'
  what():  std::bad_alloc: RMM failure at:/ssd1/chenrui/cudf-origin/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:188: Maximum pool size exceeded
Aborted

github-actions · 2021-05-25T13:17:46Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

jlowe · 2021-05-25T13:27:38Z

Still relevant

This PR is intended to optimize the URL decoding performance, especially on large URLs. Additionally, a test case for large URLs has been added. When tested on V100, baseline performance at 7521c3f ``` ------------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------------------ UrlDecode<10>/url_decode_10pct/100000000/10/manual_time 111 ms 111 ms 6 bytes_per_second=11.7959G/s UrlDecode<10>/url_decode_10pct/10000000/100/manual_time 107 ms 107 ms 7 bytes_per_second=9.0136G/s UrlDecode<10>/url_decode_10pct/1000000/1000/manual_time 107 ms 107 ms 7 bytes_per_second=8.76755G/s UrlDecode<50>/url_decode_50pct/100000000/10/manual_time 129 ms 129 ms 5 bytes_per_second=10.144G/s UrlDecode<50>/url_decode_50pct/10000000/100/manual_time 126 ms 126 ms 6 bytes_per_second=7.70821G/s UrlDecode<50>/url_decode_50pct/1000000/1000/manual_time 122 ms 122 ms 6 bytes_per_second=7.66783G/s ``` This PR ``` ------------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------------------ UrlDecode<10>/url_decode_10pct/100000000/10/manual_time 97.5 ms 97.6 ms 7 bytes_per_second=13.3669G/s UrlDecode<10>/url_decode_10pct/10000000/100/manual_time 28.8 ms 28.8 ms 24 bytes_per_second=33.6024G/s UrlDecode<10>/url_decode_10pct/1000000/1000/manual_time 21.8 ms 21.8 ms 32 bytes_per_second=42.9686G/s UrlDecode<50>/url_decode_50pct/100000000/10/manual_time 109 ms 109 ms 6 bytes_per_second=11.9786G/s UrlDecode<50>/url_decode_50pct/10000000/100/manual_time 30.2 ms 30.3 ms 23 bytes_per_second=32.0311G/s UrlDecode<50>/url_decode_50pct/1000000/1000/manual_time 22.7 ms 22.8 ms 31 bytes_per_second=41.1086G/s ``` close #8030 Authors: - https://github.com/gaohao95 Approvers: - Nghia Truong (https://github.com/ttnghia) - David Wendt (https://github.com/davidwendt) - Vyas Ramasubramani (https://github.com/vyasr) URL: #8622

chenrui17 added Needs Triage Need team to review and classify feature request New feature or request labels Apr 22, 2021

jlowe added Performance Performance related issue libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) and removed Needs Triage Need team to review and classify labels Apr 22, 2021

chenrui17 changed the title ~~[FEA]Improve url_decode performance further~~ [FEA] Improve url_decode performance further Apr 23, 2021

github-actions bot added the inactive-30d label May 25, 2021

jrhemstad added 0 - Backlog In queue waiting for assignment and removed inactive-30d labels May 25, 2021

gaohao95 mentioned this issue Jun 29, 2021

Optimize URL Decoding #8622

Merged

rapids-bot bot closed this as completed in #8622 Aug 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Improve url_decode performance further #8030

[FEA] Improve url_decode performance further #8030

chenrui17 commented Apr 22, 2021 •

edited

Loading

jlowe commented Apr 22, 2021

chenrui17 commented Apr 25, 2021 •

edited

Loading

chenrui17 commented Apr 25, 2021 •

edited

Loading

github-actions bot commented May 25, 2021

jlowe commented May 25, 2021

[FEA] Improve url_decode performance further #8030

[FEA] Improve url_decode performance further #8030

Comments

chenrui17 commented Apr 22, 2021 • edited Loading

jlowe commented Apr 22, 2021

chenrui17 commented Apr 25, 2021 • edited Loading

chenrui17 commented Apr 25, 2021 • edited Loading

github-actions bot commented May 25, 2021

jlowe commented May 25, 2021

chenrui17 commented Apr 22, 2021 •

edited

Loading

chenrui17 commented Apr 25, 2021 •

edited

Loading

chenrui17 commented Apr 25, 2021 •

edited

Loading