-
-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query csv with group by very slow #31
Labels
bug
Something isn't working
fixed
Issue is fixed and CI ready
test wanted
Feature requires testing and validation
Comments
Where can we download organizations-2000000.csv? |
This is a serious bug, any CSV data above 200MB with aggression can reproduce. |
lmangani
added
test wanted
Feature requires testing and validation
fixed
Issue is fixed and CI ready
labels
May 11, 2023
@xbsura could you kindly retest and confirm the latest release fixes the reported issue? Thanks for your report! |
confirm fixed, thanks |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
bug
Something isn't working
fixed
Issue is fixed and CI ready
test wanted
Feature requires testing and validation
Describe the situation
import chdb
res=chdb.query('select count(*) cnt from file("/Users/xbsura/Downloads/organizations-2000000.csv", CSVWithNames) group by Name order by cnt desc', 'CSV')
wc -l /Users/xbsura/Downloads/organizations-2000000.csv
2000001 /Users/xbsura/Downloads/organizations-2000000.csv
head /Users/xbsura/Downloads/organizations-2000000.csv
Index,Organization Id,Name,Website,Country,Description,Founded,Industry,Number of employees
1,391dAA77fea9EC1,Daniel-Mcmahon,https://stuart-rios.biz/,Cambodia,Focused eco-centric help-desk,2013,Sports,1878
2,9FcCA4A23e6BcfA,"Mcdowell, Tate and Murray",http://jacobs.biz/,Guyana,Front-line real-time portal,2018,Legal Services,9743
3,DB23330238B7B3D,"Roberts, Carson and Trujillo",http://www.park.com/,Jordan,Innovative hybrid data-warehouse,1992,Hospitality,7537
4,bbf18835CFbEee7,"Poole, Jefferson and Merritt",http://hayden.com/,Cocos (Keeling) Islands,Extended regional Graphic Interface,1991,Food Production,9974
this sql need more than 1min to finish, and memory used is more than 100G
Which ClickHouse server version to use
res = chdb.query('select version()', 'Pretty'); print(res.data())
┏━━━━━━━━━━━┓
┃ version() ┃
┡━━━━━━━━━━━┩
│ 22.12.1.1 │
└───────────┘
Queries to run that lead to slow performance
select count(*) cnt from file("/Users/xbsura/Downloads/organizations-2000000.csv", CSVWithNames) group by Name order by cnt desc
Expected performance
200MB file, maybe less than 1 seconds is ok
The text was updated successfully, but these errors were encountered: