Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrames with many columns are too slow (because of show()) #2739

Closed
sl-solution opened this issue Apr 28, 2021 · 9 comments · Fixed by #2750
Closed

DataFrames with many columns are too slow (because of show()) #2739

sl-solution opened this issue Apr 28, 2021 · 9 comments · Fixed by #2750
Assignees
Labels
display ecosystem Issues in DataFrames.jl ecosystem

Comments

@sl-solution
Copy link

I come across this issue, in the following example:

df = DataFrame(rand(10,10^5),:auto);

@time show(df)
11.872621 seconds (806.51 k allocations: 55.695 MiB)
@bkamins
Copy link
Member

bkamins commented Apr 28, 2021

@ronisbr I thought we had this issue resolved. Since we crop we do not need to process all columns of the table - only as much as is needed up to cropping point.

@bkamins
Copy link
Member

bkamins commented Apr 28, 2021

@ronisbr - should we transfer it to PrettyTables.jl?

@bkamins bkamins added the ecosystem Issues in DataFrames.jl ecosystem label Apr 28, 2021
ronisbr added a commit to ronisbr/PrettyTables.jl that referenced this issue Apr 28, 2021
@ronisbr
Copy link
Member

ronisbr commented Apr 28, 2021

This was very interesting! Indeed, we are only processing the columns that are printed. However, the code that treat the alignment regex was based on a dictionary. The keys of a dictionary are not sorted. Hence, we have something like:

1. For each key
    1. If this key refers to a printed column, continue.
    2. ...

The sole fact of finding if a key refers to a printed column of something that has 10^5 keys was taking that long. I now sorted the keys first, and just break the loop. Can you please test against PrettyTables master?

@bkamins
Copy link
Member

bkamins commented Apr 28, 2021

It is working now fast, so I am closing it here (@sl-solution - pleaese reopen if it is not resolved on your side):

julia> using DataFrames

julia> df = DataFrame(rand(10,10^5),:auto);

julia> @time show(df)
10×100000 DataFrame
 Row │ x1        x2         x3        x4        x5        x6         x7          x8         x9        x10       x11        x12        x13        x14       x15       x16       x17        x18       x19        x20        x21          ⋯
     │ Float64   Float64    Float64   Float64   Float64   Float64    Float64     Float64    Float64   Float64   Float64    Float64    Float64    Float64   Float64   Float64   Float64    Float64   Float64    Float64    Float64      ⋯
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 0.770095  0.930763   0.930172  0.288411  0.73186   0.422243   0.0726812   0.885706   0.412241  0.128459  0.938865   0.0904182  0.415545   0.510858  0.211872  0.364183  0.556349   0.474297  0.432206   0.889033   0.540821     ⋯
   2 │ 0.487093  0.0946448  0.31271   0.559378  0.990205  0.520601   0.987444    0.867994   0.731493  0.297628  0.261129   0.139024   0.536161   0.105239  0.82569   0.20426   0.078      0.842896  0.0656049  0.131637   0.913501
   3 │ 0.748085  0.26732    0.633036  0.366295  0.534449  0.387622   0.284518    0.398834   0.894456  0.364664  0.167245   0.234235   0.68858    0.825089  0.528478  0.442617  0.879186   0.50022   0.82288    0.640833   0.20789
   4 │ 0.778268  0.0363509  0.176529  0.464521  0.724974  0.696587   0.58628     0.838512   0.860486  0.660311  0.886006   0.833819   0.813987   0.965498  0.60225   0.955847  0.602312   0.927516  0.346862   0.632372   0.0326748
   5 │ 0.137549  0.863015   0.51994   0.152865  0.796528  0.677929   0.613353    0.144133   0.754116  0.233952  0.491231   0.749252   0.0228522  0.115496  0.98049   0.740114  0.259278   0.583515  0.896593   0.144487   0.544719     ⋯
   6 │ 0.142249  0.941106   0.03959   0.498538  0.953158  0.0127805  0.692526    0.0221065  0.122535  0.449789  0.838054   0.0308284  0.067964   0.335048  0.562823  0.394539  0.537513   0.319119  0.361563   0.0639871  0.0225556
   7 │ 0.615428  0.466185   0.750315  0.399418  0.856844  0.438293   0.00152589  0.631117   0.74643   0.225799  0.219621   0.239153   0.177381   0.922712  0.11122   0.880613  0.735859   0.822848  0.633511   0.425227   0.52695
   8 │ 0.290283  0.810359   0.665109  0.172438  0.606353  0.695701   0.410237    0.389044   0.420115  0.429968  0.880013   0.455375   0.142513   0.194378  0.49893   0.506015  0.0192198  0.309886  0.352549   0.582254   0.496949
   9 │ 0.821604  0.214029   0.991164  0.905308  0.614148  0.36473    0.520146    0.229557   0.967551  0.468765  0.938004   0.725434   0.659296   0.530171  0.134858  0.407779  0.134668   0.787943  0.439675   0.734474   0.00725157   ⋯
  10 │ 0.350686  0.509928   0.7446    0.76572   0.431624  0.0582088  0.53054     0.411554   0.631528  0.338523  0.0554198  0.709037   0.290545   0.387795  0.599815  0.906752  0.469337   0.121768  0.697765   0.376085   0.0405376
                                                                                                                                                                                                                   99979 columns omitted
  2.115613 seconds (5.42 M allocations: 325.938 MiB, 7.06% gc time, 90.53% compilation time)

julia> @time show(df)
10×100000 DataFrame
 Row │ x1        x2         x3        x4        x5        x6         x7          x8         x9        x10       x11        x12        x13        x14       x15       x16       x17        x18       x19        x20        x21          ⋯
     │ Float64   Float64    Float64   Float64   Float64   Float64    Float64     Float64    Float64   Float64   Float64    Float64    Float64    Float64   Float64   Float64   Float64    Float64   Float64    Float64    Float64      ⋯
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 0.770095  0.930763   0.930172  0.288411  0.73186   0.422243   0.0726812   0.885706   0.412241  0.128459  0.938865   0.0904182  0.415545   0.510858  0.211872  0.364183  0.556349   0.474297  0.432206   0.889033   0.540821     ⋯
   2 │ 0.487093  0.0946448  0.31271   0.559378  0.990205  0.520601   0.987444    0.867994   0.731493  0.297628  0.261129   0.139024   0.536161   0.105239  0.82569   0.20426   0.078      0.842896  0.0656049  0.131637   0.913501
   3 │ 0.748085  0.26732    0.633036  0.366295  0.534449  0.387622   0.284518    0.398834   0.894456  0.364664  0.167245   0.234235   0.68858    0.825089  0.528478  0.442617  0.879186   0.50022   0.82288    0.640833   0.20789
   4 │ 0.778268  0.0363509  0.176529  0.464521  0.724974  0.696587   0.58628     0.838512   0.860486  0.660311  0.886006   0.833819   0.813987   0.965498  0.60225   0.955847  0.602312   0.927516  0.346862   0.632372   0.0326748
   5 │ 0.137549  0.863015   0.51994   0.152865  0.796528  0.677929   0.613353    0.144133   0.754116  0.233952  0.491231   0.749252   0.0228522  0.115496  0.98049   0.740114  0.259278   0.583515  0.896593   0.144487   0.544719     ⋯
   6 │ 0.142249  0.941106   0.03959   0.498538  0.953158  0.0127805  0.692526    0.0221065  0.122535  0.449789  0.838054   0.0308284  0.067964   0.335048  0.562823  0.394539  0.537513   0.319119  0.361563   0.0639871  0.0225556
   7 │ 0.615428  0.466185   0.750315  0.399418  0.856844  0.438293   0.00152589  0.631117   0.74643   0.225799  0.219621   0.239153   0.177381   0.922712  0.11122   0.880613  0.735859   0.822848  0.633511   0.425227   0.52695
   8 │ 0.290283  0.810359   0.665109  0.172438  0.606353  0.695701   0.410237    0.389044   0.420115  0.429968  0.880013   0.455375   0.142513   0.194378  0.49893   0.506015  0.0192198  0.309886  0.352549   0.582254   0.496949
   9 │ 0.821604  0.214029   0.991164  0.905308  0.614148  0.36473    0.520146    0.229557   0.967551  0.468765  0.938004   0.725434   0.659296   0.530171  0.134858  0.407779  0.134668   0.787943  0.439675   0.734474   0.00725157   ⋯
  10 │ 0.350686  0.509928   0.7446    0.76572   0.431624  0.0582088  0.53054     0.411554   0.631528  0.338523  0.0554198  0.709037   0.290545   0.387795  0.599815  0.906752  0.469337   0.121768  0.697765   0.376085   0.0405376
                                                                                                                                                                                                                   99979 columns omitted
  0.193923 seconds (812.52 k allocations: 56.748 MiB, 6.38% gc time)

@bkamins bkamins closed this as completed Apr 28, 2021
@ronisbr
Copy link
Member

ronisbr commented Apr 28, 2021

I will tag a new version now! Thanks!

@sl-solution
Copy link
Author

The problem seems is not resolved yet:

df = DataFrame(rand(100,10^5),:auto);
show(df) # ok
allowmissing!(df)
show(df) # not fixed

@bkamins bkamins reopened this May 5, 2021
@bkamins
Copy link
Member

bkamins commented May 5, 2021

Indeed - I can reproduce it, so re-opening the issue.

@ronisbr
Copy link
Member

ronisbr commented May 5, 2021

Hi @bkamins !

It turns out that the problem now is not inside PrettyTables.jl, but with the call:

types_str = compacttype.(eltype.(eachcol(df)), maxwidth)

It is taking too long to process the names of the columns.

You can see this by executing:

julia> df = DataFrame(rand(100,10^5),:auto);
julia> allowmissing!(df);
julia> DataFrames.compacttype.(eltype.(eachcol(df)), 9)

I am not sure how we can solve this, because PrettyTables.jl needs to receive the header of the entire table. Maybe we can preallocate a vector and only fill the ones we are 100% sure they will be printed. Ideas?

@bkamins
Copy link
Member

bkamins commented May 5, 2021

I will fix it by memoization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
display ecosystem Issues in DataFrames.jl ecosystem
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants