Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query csv with group by very slow #31

Closed
xbsura opened this issue May 10, 2023 · 6 comments
Closed

Query csv with group by very slow #31

xbsura opened this issue May 10, 2023 · 6 comments
Assignees
Labels
bug Something isn't working fixed Issue is fixed and CI ready test wanted Feature requires testing and validation

Comments

@xbsura
Copy link

xbsura commented May 10, 2023

Describe the situation
import chdb
res=chdb.query('select count(*) cnt from file("/Users/xbsura/Downloads/organizations-2000000.csv", CSVWithNames) group by Name order by cnt desc', 'CSV')

wc -l /Users/xbsura/Downloads/organizations-2000000.csv
2000001 /Users/xbsura/Downloads/organizations-2000000.csv

head /Users/xbsura/Downloads/organizations-2000000.csv
Index,Organization Id,Name,Website,Country,Description,Founded,Industry,Number of employees
1,391dAA77fea9EC1,Daniel-Mcmahon,https://stuart-rios.biz/,Cambodia,Focused eco-centric help-desk,2013,Sports,1878
2,9FcCA4A23e6BcfA,"Mcdowell, Tate and Murray",http://jacobs.biz/,Guyana,Front-line real-time portal,2018,Legal Services,9743
3,DB23330238B7B3D,"Roberts, Carson and Trujillo",http://www.park.com/,Jordan,Innovative hybrid data-warehouse,1992,Hospitality,7537
4,bbf18835CFbEee7,"Poole, Jefferson and Merritt",http://hayden.com/,Cocos (Keeling) Islands,Extended regional Graphic Interface,1991,Food Production,9974

this sql need more than 1min to finish, and memory used is more than 100G

  • Which ClickHouse server version to use
    res = chdb.query('select version()', 'Pretty'); print(res.data())
    ┏━━━━━━━━━━━┓
    ┃ version() ┃
    ┡━━━━━━━━━━━┩
    │ 22.12.1.1 │
    └───────────┘

  • Queries to run that lead to slow performance
    select count(*) cnt from file("/Users/xbsura/Downloads/organizations-2000000.csv", CSVWithNames) group by Name order by cnt desc

Expected performance
200MB file, maybe less than 1 seconds is ok

@auxten
Copy link
Member

auxten commented May 10, 2023

Where can we download organizations-2000000.csv?

@xbsura
Copy link
Author

xbsura commented May 10, 2023

https://media.githubusercontent.com/media/datablist/sample-csv-files/main/files/organizations/organizations-2000000.zip

download from here

@auxten auxten added the bug Something isn't working label May 10, 2023
@auxten auxten self-assigned this May 10, 2023
@auxten
Copy link
Member

auxten commented May 10, 2023

This is a serious bug, any CSV data above 200MB with aggression can reproduce.
Will dig into this soon.

auxten added a commit that referenced this issue May 11, 2023
auxten added a commit that referenced this issue May 11, 2023
@lmangani
Copy link
Contributor

lmangani commented May 11, 2023

Test seems to be passing with chdb 0.8.0 and libchdb 0.8.0 which includes #32 by @auxten

@lmangani lmangani added test wanted Feature requires testing and validation fixed Issue is fixed and CI ready labels May 11, 2023
@lmangani
Copy link
Contributor

@xbsura could you kindly retest and confirm the latest release fixes the reported issue? Thanks for your report!

@xbsura
Copy link
Author

xbsura commented Jun 5, 2023

@xbsura could you kindly retest and confirm the latest release fixes the reported issue? Thanks for your report!

confirm fixed, thanks

@xbsura xbsura closed this as completed Jun 5, 2023
auxten added a commit that referenced this issue Jun 27, 2023
auxten added a commit that referenced this issue Jun 27, 2023
auxten added a commit that referenced this issue Jun 28, 2023
auxten added a commit that referenced this issue Jun 28, 2023
auxten added a commit that referenced this issue Aug 15, 2023
auxten added a commit that referenced this issue Aug 15, 2023
auxten added a commit that referenced this issue Nov 9, 2023
auxten added a commit that referenced this issue Nov 9, 2023
auxten added a commit that referenced this issue Dec 8, 2023
auxten added a commit that referenced this issue Dec 8, 2023
auxten added a commit that referenced this issue Jun 7, 2024
auxten added a commit that referenced this issue Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fixed Issue is fixed and CI ready test wanted Feature requires testing and validation
Projects
None yet
Development

No branches or pull requests

3 participants