[BUG] Quadratic (in number of columns) behaviour in `read_csv` #14005

wence- · 2023-08-30T15:07:33Z

Describe the bug

When calling cudf.read_csv on a CSV file with many (hundreds of thousands) of columns, we take an unexpectedly long time. Yes, I don't expect this to be performant, but...

I create a sequence of CSV files with 1 row, and N columns:

%time cudf.read_csv("1000.csv", header=None);
CPU times: user 141 ms, sys: 3.67 ms, total: 145 ms
Wall time: 143 ms

%time cudf.read_csv("10000.csv", header=None);
CPU times: user 10.3 s, sys: 2.11 ms, total: 10.3 s
Wall time: 10.3 s

%time cudf.read_csv("20000.csv", header=None);
CPU times: user 41.8 s, sys: 48 ms, total: 41.9 s
Wall time: 41.6 s

The culprit is this bit of code:

        dtype = {} if dtype is None else dtype
        unspecified_dtypes = {
            name: df._dtypes[name]
            for name in df._column_names
            if name not in dtype
        }

This looks innocuous, but unfortunately, df._dtypes is a property:

@property
def _dtypes(self):
    return dict(
        zip(self._data.names, (col.dtype for col in self._data.columns))
    )

So the name to dtype lookup in the loop is O(ncolumns) rather than O(1).

After a localised fix:

diff --git a/python/cudf/cudf/io/csv.py b/python/cudf/cudf/io/csv.py
index 95e0aa1807..3a74e75556 100644
--- a/python/cudf/cudf/io/csv.py
+++ b/python/cudf/cudf/io/csv.py
@@ -123,11 +123,12 @@ def read_csv(
     if dtype is None or isinstance(dtype, abc.Mapping):
         # There exists some dtypes in the result columns that is inferred.
         # Find them and map them to the default dtypes.
-        dtype = {} if dtype is None else dtype
+        known_dtypes = set() if dtype is None else set(dtype)
+        df_dtypes = df._dtypes
         unspecified_dtypes = {
-            name: df._dtypes[name]
+            name: df_dtypes[name]
             for name in df._column_names
-            if name not in dtype
+            if name not in known_dtypes
         }
         default_dtypes = {}

In [3]: %time cudf.read_csv("1000.csv", header=None);
CPU times: user 49.1 ms, sys: 4.29 ms, total: 53.4 ms
Wall time: 51.8 ms

In [4]: %time cudf.read_csv("10000.csv", header=None);
CPU times: user 485 ms, sys: 12.1 ms, total: 497 ms
Wall time: 493 ms

In [5]: %time cudf.read_csv("20000.csv", header=None);
CPU times: user 897 ms, sys: 14.8 ms, total: 912 ms
Wall time: 906 ms

and all is right with the world.

The text was updated successfully, but these errors were encountered:

wence- · 2023-08-30T15:23:59Z

Looking a bit more closely, the _dtypes property is accessed in three places that will produce the same behaviour. In read_json, and the dtypes property of GroupBy objects.

Ideally one would avoid this particular footgun by making _dtypes a cached_property, but is that problematic if someone updates a dataframe "inplace"?

Frame._dtypes maps column names to dtypes, however, it is a property that is computed on-demand. Consequently, a seemingly innocuous dict lookup is actually O(N). When used in a loop over columns, this makes an O(N) loop into an O(N^2) one. This mostly bites on IO when reading data with many thousands of columns. To fix this, manually move access of Frame._dtypes outside of any loop over columns. A more systematic way might be to make this a cached property, but the cache invalidation is rather hard to reason about. - Closes rapidsai#14005

Frame._dtypes maps column names to dtypes, however, it is a property that is computed on-demand. Consequently, a seemingly innocuous dict lookup is actually O(N). When used in a loop over columns, this makes an O(N) loop into an O(N^2) one. This mostly bites on IO when reading data with many thousands of columns. To fix this, manually move access of Frame._dtypes outside of any loop over columns. A more systematic way might be to make this a cached property, but the cache invalidation is rather hard to reason about. - Closes #14005 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - https://github.com/brandon-b-miller URL: #14028

wence- added bug Something isn't working Needs Triage Need team to review and classify labels Aug 30, 2023

wence- self-assigned this Sep 1, 2023

wence- added 2 - In Progress Currently a work in progress Performance Performance related issue and removed Needs Triage Need team to review and classify labels Sep 1, 2023

wence- mentioned this issue Sep 1, 2023

Remove quadratic runtime due to accessing Frame._dtypes in loop #14028

Merged

3 tasks

rapids-bot bot closed this as completed in #14028 Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Quadratic (in number of columns) behaviour in `read_csv` #14005

[BUG] Quadratic (in number of columns) behaviour in `read_csv` #14005

wence- commented Aug 30, 2023

wence- commented Aug 30, 2023

[BUG] Quadratic (in number of columns) behaviour in read_csv #14005

[BUG] Quadratic (in number of columns) behaviour in read_csv #14005

Comments

wence- commented Aug 30, 2023

wence- commented Aug 30, 2023

[BUG] Quadratic (in number of columns) behaviour in `read_csv` #14005

[BUG] Quadratic (in number of columns) behaviour in `read_csv` #14005