Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor readcsv speed #16015

Closed
femtotrader opened this issue Apr 23, 2016 · 13 comments
Closed

Poor readcsv speed #16015

femtotrader opened this issue Apr 23, 2016 · 13 comments

Comments

@femtotrader
Copy link

femtotrader commented Apr 23, 2016

Hello,

I try to read 1 month of tick data of AUD/USD

Sample data can be found here
https://drive.google.com/file/d/0B8iUtWjZOTqla3ZZTC1FS0pkZXc/view?usp=sharing
see also pydata/pandas-datareader#153

AUDUSD-2014-01.zip is a 11M file and contains AUDUSD-2014-01.csv which is a 85M file
which is not so big!

With Python / Pandas

$ ipython

In [1]: import pandas as pd

In [2]: %time df=pd.read_csv("AUDUSD-2014-01.csv", names=['Symbol', 'Date', 'Bid', 'Ask'])
CPU times: user 3.22 s, sys: 510 ms, total: 3.73 s
Wall time: 4.02 s

With Julia / readcsv

julia> @time df=readcsv("AUDUSD-2014-01.csv");
 11.376916 seconds (31.17 M allocations: 1.012 GB, 33.02% gc time)

It's even worse with DataFrames.jl readtable
see JuliaData/DataFrames.jl#942

Kind regards

@nalimilan
Copy link
Member

This is a well-known problem. See https://github.com/JuliaDB/CSV.jl for a faster alternative.

@femtotrader
Copy link
Author

Thanks. So I'm closing.

@femtotrader
Copy link
Author

julia> @time df=CSV.read("AUDUSD-2014-01.csv");
  0.168033 seconds (17.87 k allocations: 85.767 MB, 5.83% gc time)

That's much better... I wonder if DataFrames.jl readtable shouldn't use it ?

@femtotrader
Copy link
Author

Oh... I did wrong...

julia> df
89103612-element Array{UInt8,1}:
 0x41
 0x55
 0x44
 0x2f
 0x55
 0x53
 0x44
 0x2c
 0x32
 0x30
 0x31
 0x34
 0x30
 0x31
 0x30
    
 0x2e
 0x38
 0x37
 0x35
 0x33
 0x31
 0x2c
 0x30
 0x2e
 0x38
 0x37
 0x35
 0x37
 0x34
 0x0a

@nalimilan
Copy link
Member

See this blog post about how to use it: http://julialang.org/blog/2015/10/datastreams

Anyway, I think the plan is to use it when it's ready, but for now the data management packages are quite in flux.

@femtotrader
Copy link
Author

Thanks

@nalimilan
Copy link
Member

Anyway, I think the plan is to use it when it's ready, but for now the data management packages are quite in flux.

Correction: I was talking about DataFrames.jl, but I'm not sure what's the plan as regards Julia Base.

@nalimilan
Copy link
Member

@ViralBShah Since you noted in the other issue that readdlm was supposed to be as fast as Pandas, should we reopen this issue in order to have a deeper look?

@ViralBShah
Copy link
Member

I should clarify that I didn't test pandas on my machine and was using 5 sec as the benchmark, but this is what I see. Perhaps if we can make this more efficient wrt GC, this could be faster.

viral-laptop 06:53:30 ~/Downloads$ j
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.5.0-dev+3623 (2016-04-19 03:38 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit fb81034 (4 days old master)
|__/                   |  x86_64-apple-darwin15.4.0

julia> @time readdlm("AUDUSD-2014-01.csv");
  5.470630 seconds (17.09 M allocations: 649.867 MB, 7.54% gc time)

julia> @time readcsv("AUDUSD-2014-01.csv");
  4.077544 seconds (31.19 M allocations: 1.013 GB, 7.18% gc time)

@ViralBShah
Copy link
Member

Pandas is almost twice as fast as reported:

In [2]: import pandas as pd

In [3]: %time df=pd.read_csv("AUDUSD-2014-01.csv", names=['Symbol', 'Date', 'Bid', 'Ask'])
CPU times: user 2.09 s, sys: 155 ms, total: 2.25 s
Wall time: 2.28 s

@quinnj
Copy link
Member

quinnj commented Apr 23, 2016

I get:

>>> timeit.timeit("df=pandas.read_csv('/Users/jacobquinn/Downloads/AUDUSD-2014-01.csv', names=['Symbol', 'Date', 'Bid', 'Ask'])",setup='import pandas',number=1)
1.7587840557098389
julia> @time CSV.read("/Users/jacobquinn/Downloads/AUDUSD-2014-01.csv"; header=["Symbol","Date","Bid","Ask"]);
  1.809735 seconds (12.13 M allocations: 323.253 MB, 9.30% gc time)

(note this is CSV master)

@jiahao
Copy link
Member

jiahao commented Apr 24, 2016

Duplicate of #10428

@ViralBShah
Copy link
Member

Should we be asking people to start using CSV?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants