diff --git a/docs/cudf/source/user_guide/pandas-comparison.md b/docs/cudf/source/user_guide/pandas-comparison.md new file mode 100644 index 00000000000..e0e4dc0157e --- /dev/null +++ b/docs/cudf/source/user_guide/pandas-comparison.md @@ -0,0 +1,155 @@ +# Comparison of cuDF and Pandas + +cuDF is a DataFrame library that closely matches the Pandas API, but +leverages NVIDIA GPUs for performing computations for speed. However, +there are some differences between cuDF and Pandas, both in terms API +and behavior. This page documents the similarities and differences +between cuDF and Pandas. + +## Supported operations + +cuDF supports many of the same data structures and operations as +Pandas. This includes `Series`, `DataFrame`, `Index` and +operations on them such as unary and binary operations, indexing, +filtering, concatenating, joining, groupby and window operations - +among many others. + +The best way to see if we support a particular Pandas API is to search +our [API docs](/api_docs/index). + +## Data types + +cuDF supports many common data types supported by Pandas, including +numeric, datetime, timestamp, string, and categorical data types. In +addition, we support special data types for decimal, list and "struct" +values. See the section on [Data Types](data-types) for +details. + +Note that we do not support custom data types like Pandas' +`ExtensionDtype`. + +## Null (or "missing") values + +Unlike Pandas, *all* data types in cuDF are nullable, +meaning they can contain missing values (represented by `cudf.NA`). + +```{code} python +>>> s = cudf.Series([1, 2, cudf.NA]) +>>> s +>>> s +0 1 +1 2 +2 +dtype: int64 +``` + +Nulls are not coerced to `nan` in any situation; +compare the behaviour of cuDF with Pandas below: + +```{code} python +>>> s = cudf.Series([1, 2, cudf.NA], dtype="category") +>>> s +0 1 +1 2 +2 +dtype: category +Categories (2, int64): [1, 2] + +>>> s = pd.Series([1, 2, pd.NA], dtype="category") +>>> s +0 1 +1 2 +2 NaN +dtype: category +Categories (2, int64): [1, 2] +``` + +See the docs on [missing data](Working-with-missing-data) for +details. + +## Iteration + +Iterating over a cuDF `Series`, `DataFrame` or `Index` is not +supported. This is because iterating over data that resides on the GPU +will yield *extremely* poor performance, as GPUs are optimized for +highly parallel operations rather than sequential operations. + +In the vast majority of cases, it is possible to avoid iteration and +use an existing function or method to accomplish the same task. If you +absolutely must iterate, copy the data from GPU to CPU by using +`.to_arrow()` or `.to_pandas()`, then copy the result back to GPU +using `.from_arrow()` or `.from_pandas()`. + +## Result ordering + +By default, `join` (or `merge`) and `groupby` operations in cuDF +do *not* guarantee output ordering by default. +Compare the results obtained from Pandas and cuDF below: + +```{code} python + >>> import cupy as cp + >>> df = cudf.DataFrame({'a': cp.random.randint(0, 1000, 1000), 'b': range(1000)}) + >>> df.groupby("a").mean().head() + b + a + 742 694.5 + 29 840.0 + 459 525.5 + 442 363.0 + 666 7.0 + >>> df.to_pandas().groupby("a").mean().head() + b + a + 2 643.75 + 6 48.00 + 7 631.00 + 9 906.00 + 10 640.00 +``` + +To match Pandas behavior, you must explicitly pass `sort=True`: + +```{code} python +>>> df.to_pandas().groupby("a", sort=True).mean().head() + b +a +2 643.75 +6 48.00 +7 631.00 +9 906.00 +10 640.00 +``` + +## Column names + +Unlike Pandas, cuDF does not support duplicate column names. +It is best to use strings for column names. + +## No true `"object"` data type + +In Pandas and NumPy, the `"object"` data type is used for +collections of arbitrary Python objects. For example, in Pandas you +can do the following: + +```{code} python +>>> import pandas as pd +>>> s = pd.Series(["a", 1, [1, 2, 3]]) +0 a +1 1 +2 [1, 2, 3] +dtype: object +``` + +For compatibilty with Pandas, cuDF reports the data type for strings +as `"object"`, but we do *not* support storing or operating on +collections of arbitrary Python objects. + +## `.apply()` function limitations + +The `.apply()` function in Pandas accecpts a user-defined function +(UDF) that can include arbitrary operations that are applied to each +value of a `Series`, `DataFrame`, or in the case of a groupby, +each group. cuDF also supports `apply()`, but it relies on Numba to +JIT compile the UDF and execute it on the GPU. This can be extremely +fast, but imposes a few limitations on what operations are allowed in +the UDF. See the docs on [UDFs](guide-to-udfs) for details. diff --git a/docs/cudf/source/user_guide/pandas-comparison.rst b/docs/cudf/source/user_guide/pandas-comparison.rst deleted file mode 100644 index 23cc66ce7e6..00000000000 --- a/docs/cudf/source/user_guide/pandas-comparison.rst +++ /dev/null @@ -1,164 +0,0 @@ -Comparison of cuDF and Pandas -============================= - -cuDF is a DataFrame library that closely matches the Pandas API, but -leverages NVIDIA GPUs for performing computations for speed. However, -there are some differences between cuDF and Pandas, both in terms API -and behavior. This page documents the similarities and differences -between cuDF and Pandas. - -Supported operations --------------------- - -cuDF supports many of the same data structures and operations as -Pandas. This includes ``Series``, ``DataFrame``, ``Index`` and -operations on them such as unary and binary operations, indexing, -filtering, concatenating, joining, groupby and window operations - -among many others. - -The best way to see if we support a particular Pandas API is to search -our `API docs `_. - -Data types ----------- - -cuDF supports many common data types supported by Pandas, including -numeric, datetime, timestamp, string, and categorical data types. In -addition, we support special data types for decimal, list and "struct" -values. See the section on :doc:`Data Types ` for -details. - -Note that we do not support custom data types like Pandas' -``ExtensionDtype``. - -Null (or "missing") values --------------------------- - -Unlike Pandas, *all* data types in cuDF are nullable, -meaning they can contain missing values (represented by ``cudf.NA``). - -.. code:: python - - >>> s = cudf.Series([1, 2, cudf.NA]) - >>> s - >>> s - 0 1 - 1 2 - 2 - dtype: int64 - -Nulls are not coerced to ``nan`` in any situation; -compare the behaviour of cuDF with Pandas below: - -.. code:: python - - >>> s = cudf.Series([1, 2, cudf.NA], dtype="category") - >>> s - 0 1 - 1 2 - 2 - dtype: category - Categories (2, int64): [1, 2] - - >>> s = pd.Series([1, 2, pd.NA], dtype="category") - >>> s - 0 1 - 1 2 - 2 NaN - dtype: category - Categories (2, int64): [1, 2] - -See our :doc:`docs on missing data` -for details. - -Iteration ---------- - -Iterating over a cuDF ``Series``, ``DataFrame`` or ``Index`` is not -supported. This is because iterating over data that resides on the GPU -will yield *extremely* poor performance, as GPUs are optimized for -highly parallel operations rather than sequential operations. - -In the vast majority of cases, it is possible to avoid iteration and -use an existing function or method to accomplish the same task. If you -absolutely must iterate, copy the data from GPU to CPU by using -``.to_arrow()`` or ``.to_pandas()``, then copy the result back to GPU -using ``.from_arrow()`` or ``.from_pandas()``. - -Result ordering ---------------- - -By default, ``join`` (or ``merge``) and ``groupby`` operations in cuDF -do *not* guarantee output ordering by default. -Compare the results obtained from Pandas and cuDF below: - -.. code:: python - - >>> import cupy as cp - >>> df = cudf.DataFrame({'a': cp.random.randint(0, 1000, 1000), 'b': range(1000)}) - >>> df.groupby("a").mean().head() - b - a - 742 694.5 - 29 840.0 - 459 525.5 - 442 363.0 - 666 7.0 - >>> df.to_pandas().groupby("a").mean().head() - b - a - 2 643.75 - 6 48.00 - 7 631.00 - 9 906.00 - 10 640.00 - -To match Pandas behavior, you must explicitly pass ``sort=True``: - -.. code:: python - - >>> df.to_pandas().groupby("a", sort=True).mean().head() - b - a - 2 643.75 - 6 48.00 - 7 631.00 - 9 906.00 - 10 640.00 - -Column names ------------- - -Unlike Pandas, cuDF does not support duplicate column names. -It is best to use strings for column names. - -No true ``"object"`` data type ------------------------------- - -In Pandas and NumPy, the ``"object"`` data type is used for -collections of arbitrary Python objects. For example, in Pandas you -can do the following: - -.. code:: python - - >>> import pandas as pd - >>> s = pd.Series(["a", 1, [1, 2, 3]]) - 0 a - 1 1 - 2 [1, 2, 3] - dtype: object - -For compatibilty with Pandas, cuDF reports the data type for strings -as ``"object"``, but we do *not* support storing or operating on -collections of arbitrary Python objects. - -``.apply()`` function limitations ---------------------------------- - -The ``.apply()`` function in Pandas accecpts a user-defined function -(UDF) that can include arbitrary operations that are applied to each -value of a ``Series``, ``DataFrame``, or in the case of a groupby, -each group. cuDF also supports ``apply()``, but it relies on Numba to -JIT compile the UDF and execute it on the GPU. This can be extremely -fast, but imposes a few limitations on what operations are allowed in -the UDF. See our :doc:`UDF docs ` for details.