Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Quadratic (in number of columns) behaviour in read_csv #14005

Closed
wence- opened this issue Aug 30, 2023 · 1 comment · Fixed by #14028
Closed

[BUG] Quadratic (in number of columns) behaviour in read_csv #14005

wence- opened this issue Aug 30, 2023 · 1 comment · Fixed by #14028
Assignees
Labels
2 - In Progress Currently a work in progress bug Something isn't working Performance Performance related issue

Comments

@wence-
Copy link
Contributor

wence- commented Aug 30, 2023

Describe the bug

When calling cudf.read_csv on a CSV file with many (hundreds of thousands) of columns, we take an unexpectedly long time. Yes, I don't expect this to be performant, but...

I create a sequence of CSV files with 1 row, and N columns:

%time cudf.read_csv("1000.csv", header=None);
CPU times: user 141 ms, sys: 3.67 ms, total: 145 ms
Wall time: 143 ms

%time cudf.read_csv("10000.csv", header=None);
CPU times: user 10.3 s, sys: 2.11 ms, total: 10.3 s
Wall time: 10.3 s

%time cudf.read_csv("20000.csv", header=None);
CPU times: user 41.8 s, sys: 48 ms, total: 41.9 s
Wall time: 41.6 s

The culprit is this bit of code:

        dtype = {} if dtype is None else dtype
        unspecified_dtypes = {
            name: df._dtypes[name]
            for name in df._column_names
            if name not in dtype
        }

This looks innocuous, but unfortunately, df._dtypes is a property:

@property
def _dtypes(self):
    return dict(
        zip(self._data.names, (col.dtype for col in self._data.columns))
    )

So the name to dtype lookup in the loop is O(ncolumns) rather than O(1).

After a localised fix:

diff --git a/python/cudf/cudf/io/csv.py b/python/cudf/cudf/io/csv.py
index 95e0aa1807..3a74e75556 100644
--- a/python/cudf/cudf/io/csv.py
+++ b/python/cudf/cudf/io/csv.py
@@ -123,11 +123,12 @@ def read_csv(
     if dtype is None or isinstance(dtype, abc.Mapping):
         # There exists some dtypes in the result columns that is inferred.
         # Find them and map them to the default dtypes.
-        dtype = {} if dtype is None else dtype
+        known_dtypes = set() if dtype is None else set(dtype)
+        df_dtypes = df._dtypes
         unspecified_dtypes = {
-            name: df._dtypes[name]
+            name: df_dtypes[name]
             for name in df._column_names
-            if name not in dtype
+            if name not in known_dtypes
         }
         default_dtypes = {}
 
In [3]: %time cudf.read_csv("1000.csv", header=None);
CPU times: user 49.1 ms, sys: 4.29 ms, total: 53.4 ms
Wall time: 51.8 ms

In [4]: %time cudf.read_csv("10000.csv", header=None);
CPU times: user 485 ms, sys: 12.1 ms, total: 497 ms
Wall time: 493 ms

In [5]: %time cudf.read_csv("20000.csv", header=None);
CPU times: user 897 ms, sys: 14.8 ms, total: 912 ms
Wall time: 906 ms

and all is right with the world.

@wence- wence- added bug Something isn't working Needs Triage Need team to review and classify labels Aug 30, 2023
@wence-
Copy link
Contributor Author

wence- commented Aug 30, 2023

Looking a bit more closely, the _dtypes property is accessed in three places that will produce the same behaviour. In read_json, and the dtypes property of GroupBy objects.

Ideally one would avoid this particular footgun by making _dtypes a cached_property, but is that problematic if someone updates a dataframe "inplace"?

@wence- wence- self-assigned this Sep 1, 2023
@wence- wence- added 2 - In Progress Currently a work in progress Performance Performance related issue and removed Needs Triage Need team to review and classify labels Sep 1, 2023
wence- added a commit to wence-/cudf that referenced this issue Sep 1, 2023
Frame._dtypes maps column names to dtypes, however, it is a property
that is computed on-demand. Consequently, a seemingly innocuous dict
lookup is actually O(N). When used in a loop over columns, this makes
an O(N) loop into an O(N^2) one.

This mostly bites on IO when reading data with many thousands of
columns. To fix this, manually move access of Frame._dtypes outside of
any loop over columns.

A more systematic way might be to make this a cached property, but the
cache invalidation is rather hard to reason about.

- Closes rapidsai#14005
rapids-bot bot pushed a commit that referenced this issue Sep 1, 2023
Frame._dtypes maps column names to dtypes, however, it is a property that is computed on-demand. Consequently, a seemingly innocuous dict lookup is actually O(N). When used in a loop over columns, this makes an O(N) loop into an O(N^2) one.

This mostly bites on IO when reading data with many thousands of columns. To fix this, manually move access of Frame._dtypes outside of any loop over columns.

A more systematic way might be to make this a cached property, but the cache invalidation is rather hard to reason about.

- Closes #14005

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - https://github.com/brandon-b-miller

URL: #14028
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress bug Something isn't working Performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant