Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Improve url_decode performance further #8030

Closed
chenrui17 opened this issue Apr 22, 2021 · 5 comments · Fixed by #8622
Closed

[FEA] Improve url_decode performance further #8030

chenrui17 opened this issue Apr 22, 2021 · 5 comments · Fixed by #8622
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue strings strings issues (C++ and Python)

Comments

@chenrui17
Copy link
Contributor

chenrui17 commented Apr 22, 2021

Is your feature request related to a problem? Please describe.
#7571 follow-up , url_decode performance still not ideal, gpu performance V.S. cpu performance is basically flat, and gpu util is very high, and there is still a lot of room for url_decode optimization.

Describe the solution you'd like
Here is the query trace which is contains url_decode.
image

Describe alternatives you've considered
None

Additional context
None

@chenrui17 chenrui17 added Needs Triage Need team to review and classify feature request New feature or request labels Apr 22, 2021
@jlowe jlowe added Performance Performance related issue libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) and removed Needs Triage Need team to review and classify labels Apr 22, 2021
@jlowe
Copy link
Member

jlowe commented Apr 22, 2021

@chenrui17 it would be good to get a better sense of the input that is triggering the poor behavior.

  • What is the average byte length of each string?
  • How many URL-encoding escape sequences occur per string on average?

Hopefully the URL decode benchmarks can be updated accordingly to reproduce the behavior shown in the trace, and then the code can be optimized against that benchmark.

@chenrui17 chenrui17 changed the title [FEA]Improve url_decode performance further [FEA] Improve url_decode performance further Apr 23, 2021
@chenrui17
Copy link
Contributor Author

chenrui17 commented Apr 25, 2021

@chenrui17 it would be good to get a better sense of the input that is triggering the poor behavior.

  • What is the average byte length of each string?
  • How many URL-encoding escape sequences occur per string on average?

Hopefully the URL decode benchmarks can be updated accordingly to reproduce the behavior shown in the trace, and then the code can be optimized against that benchmark.

The average length of each string is 800 (max lenth is 4686 ) , about 200 escape sequences occur per string on average ( max is about 1500)

In addition, i insert some time clock code in function url_decode, but it spent time only 2ms event though my input string lengh is 1000+(input only one string), so i guess the problem may be url field of input column has too many rows. right ? if so, how to solve it ? and also i am ready to set BUILD_BENCHMARKS=ON and reproduce it in libcudf level.

@chenrui17
Copy link
Contributor Author

chenrui17 commented Apr 25, 2021

@jlowe Attach url_decode benchmark result. I guess the problem is my input parquet file row count is too big, and it's about 1,800,000, and there are about 20000 files.

Running ./cpp/build/gbenchmarks/STRINGS_BENCH
Run on (80 X 2401 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x40)
L1 Instruction 32 KiB (x40)
L2 Unified 1024 KiB (x40)
L3 Unified 28160 KiB (x2)
Load Average: 2.70, 2.82, 6.88
WARNING CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.

-------------------------------------------------------------------------------------------------------------------
Benchmark                                                         Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------
UrlDecode<10>/url_decode_10pct/100/10/manual_time             0.148 ms        0.175 ms         4184 bytes_per_second=9.03193M/s
UrlDecode<10>/url_decode_10pct/1000/10/manual_time            0.144 ms        0.170 ms         4699 bytes_per_second=92.7114M/s
UrlDecode<10>/url_decode_10pct/10000/10/manual_time           0.162 ms        0.187 ms         3798 bytes_per_second=825.938M/s
UrlDecode<10>/url_decode_10pct/100000/10/manual_time          0.206 ms        0.230 ms         3166 bytes_per_second=6.32715G/s
UrlDecode<10>/url_decode_10pct/1000000/10/manual_time         0.779 ms        0.804 ms          898 bytes_per_second=16.7397G/s
UrlDecode<10>/url_decode_10pct/100/100/manual_time            0.146 ms        0.173 ms         3624 bytes_per_second=67.8533M/s
UrlDecode<10>/url_decode_10pct/1000/100/manual_time           0.155 ms        0.180 ms         3579 bytes_per_second=639.935M/s
UrlDecode<10>/url_decode_10pct/10000/100/manual_time          0.211 ms        0.236 ms         3190 bytes_per_second=4.58927G/s
UrlDecode<10>/url_decode_10pct/100000/100/manual_time         0.762 ms        0.787 ms          914 bytes_per_second=12.7101G/s
UrlDecode<10>/url_decode_10pct/1000000/100/manual_time         6.93 ms         6.96 ms          101 bytes_per_second=13.9688G/s
UrlDecode<10>/url_decode_10pct/100/1000/manual_time           0.152 ms        0.177 ms         3572 bytes_per_second=631.421M/s
UrlDecode<10>/url_decode_10pct/1000/1000/manual_time          0.208 ms        0.233 ms         3295 bytes_per_second=4.49671G/s
UrlDecode<10>/url_decode_10pct/10000/1000/manual_time         0.752 ms        0.777 ms          930 bytes_per_second=12.438G/s
UrlDecode<10>/url_decode_10pct/100000/1000/manual_time         6.83 ms         6.86 ms          102 bytes_per_second=13.6918G/s
UrlDecode<10>/url_decode_10pct/1000000/1000/manual_time        73.2 ms         73.3 ms            9 bytes_per_second=12.7679G/s
UrlDecode<10>/url_decode_10pct/100/10000/manual_time          0.191 ms        0.215 ms         3306 bytes_per_second=4.88966G/s
UrlDecode<10>/url_decode_10pct/1000/10000/manual_time         0.741 ms        0.766 ms          945 bytes_per_second=12.5748G/s
UrlDecode<10>/url_decode_10pct/10000/10000/manual_time         6.72 ms         6.75 ms          104 bytes_per_second=13.8707G/s
UrlDecode<10>/url_decode_10pct/100000/10000/manual_time        71.9 ms         72.0 ms            9 bytes_per_second=12.9505G/s
terminate called after throwing an instance of 'rmm::bad_alloc'
  what():  std::bad_alloc: RMM failure at:/ssd1/chenrui/cudf-origin/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/pool_memory_resource.hpp:188: Maximum pool size exceeded
Aborted

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@jlowe
Copy link
Member

jlowe commented May 25, 2021

Still relevant

@jrhemstad jrhemstad added 0 - Backlog In queue waiting for assignment and removed inactive-30d labels May 25, 2021
rapids-bot bot pushed a commit that referenced this issue Aug 30, 2021
This PR is intended to optimize the URL decoding performance, especially on large URLs. Additionally, a test case for large URLs has been added.

When tested on V100, baseline performance at 7521c3f
```
------------------------------------------------------------------------------------------------------------------
Benchmark                                                        Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------
UrlDecode<10>/url_decode_10pct/100000000/10/manual_time        111 ms          111 ms            6 bytes_per_second=11.7959G/s
UrlDecode<10>/url_decode_10pct/10000000/100/manual_time        107 ms          107 ms            7 bytes_per_second=9.0136G/s
UrlDecode<10>/url_decode_10pct/1000000/1000/manual_time        107 ms          107 ms            7 bytes_per_second=8.76755G/s
UrlDecode<50>/url_decode_50pct/100000000/10/manual_time        129 ms          129 ms            5 bytes_per_second=10.144G/s
UrlDecode<50>/url_decode_50pct/10000000/100/manual_time        126 ms          126 ms            6 bytes_per_second=7.70821G/s
UrlDecode<50>/url_decode_50pct/1000000/1000/manual_time        122 ms          122 ms            6 bytes_per_second=7.66783G/s
```

This PR
```
------------------------------------------------------------------------------------------------------------------
Benchmark                                                        Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------
UrlDecode<10>/url_decode_10pct/100000000/10/manual_time       97.5 ms         97.6 ms            7 bytes_per_second=13.3669G/s
UrlDecode<10>/url_decode_10pct/10000000/100/manual_time       28.8 ms         28.8 ms           24 bytes_per_second=33.6024G/s
UrlDecode<10>/url_decode_10pct/1000000/1000/manual_time       21.8 ms         21.8 ms           32 bytes_per_second=42.9686G/s
UrlDecode<50>/url_decode_50pct/100000000/10/manual_time        109 ms          109 ms            6 bytes_per_second=11.9786G/s
UrlDecode<50>/url_decode_50pct/10000000/100/manual_time       30.2 ms         30.3 ms           23 bytes_per_second=32.0311G/s
UrlDecode<50>/url_decode_50pct/1000000/1000/manual_time       22.7 ms         22.8 ms           31 bytes_per_second=41.1086G/s
```

close #8030

Authors:
  - https://github.com/gaohao95

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - David Wendt (https://github.com/davidwendt)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #8622
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants