diff --git a/docs/cudf/source/user_guide/10min.ipynb b/docs/cudf/source/user_guide/10min.ipynb index 080fce3c55c..b9278151e64 100644 --- a/docs/cudf/source/user_guide/10min.ipynb +++ b/docs/cudf/source/user_guide/10min.ipynb @@ -2,7 +2,6 @@ "cells": [ { "cell_type": "markdown", - "id": "e9357872", "metadata": {}, "source": [ "10 Minutes to cuDF and Dask-cuDF\n", @@ -27,7 +26,6 @@ { "cell_type": "code", "execution_count": 1, - "id": "92eed4cb", "metadata": {}, "outputs": [], "source": [ @@ -47,7 +45,6 @@ }, { "cell_type": "markdown", - "id": "ed6c6047", "metadata": {}, "source": [ "Object Creation\n", @@ -56,7 +53,6 @@ }, { "cell_type": "markdown", - "id": "aeedd961", "metadata": {}, "source": [ "Creating a `cudf.Series` and `dask_cudf.Series`." @@ -65,7 +61,6 @@ { "cell_type": "code", "execution_count": 2, - "id": "cf8b08e5", "metadata": {}, "outputs": [ { @@ -92,7 +87,6 @@ { "cell_type": "code", "execution_count": 3, - "id": "083a5898", "metadata": {}, "outputs": [ { @@ -118,7 +112,6 @@ }, { "cell_type": "markdown", - "id": "6346e1b1", "metadata": {}, "source": [ "Creating a `cudf.DataFrame` and a `dask_cudf.DataFrame` by specifying values for each column." @@ -127,7 +120,6 @@ { "cell_type": "code", "execution_count": 4, - "id": "83d1e7f5", "metadata": {}, "outputs": [ { @@ -321,7 +313,6 @@ { "cell_type": "code", "execution_count": 5, - "id": "71b61d62", "metadata": {}, "outputs": [ { @@ -511,7 +502,6 @@ }, { "cell_type": "markdown", - "id": "c7cb5abc", "metadata": {}, "source": [ "Creating a `cudf.DataFrame` from a pandas `Dataframe` and a `dask_cudf.Dataframe` from a `cudf.Dataframe`.\n", @@ -522,7 +512,6 @@ { "cell_type": "code", "execution_count": 6, - "id": "07a62244", "metadata": {}, "outputs": [ { @@ -597,7 +586,6 @@ { "cell_type": "code", "execution_count": 7, - "id": "f5cb0c65", "metadata": {}, "outputs": [ { @@ -670,7 +658,6 @@ }, { "cell_type": "markdown", - "id": "025eac40", "metadata": {}, "source": [ "Viewing Data\n", @@ -679,7 +666,6 @@ }, { "cell_type": "markdown", - "id": "47a567e8", "metadata": {}, "source": [ "Viewing the top rows of a GPU dataframe." @@ -688,7 +674,6 @@ { "cell_type": "code", "execution_count": 8, - "id": "ab8cbdb8", "metadata": {}, "outputs": [ { @@ -752,7 +737,6 @@ { "cell_type": "code", "execution_count": 9, - "id": "2e923d8a", "metadata": {}, "outputs": [ { @@ -815,7 +799,6 @@ }, { "cell_type": "markdown", - "id": "61257b4b", "metadata": {}, "source": [ "Sorting by values." @@ -824,7 +807,6 @@ { "cell_type": "code", "execution_count": 10, - "id": "512770f9", "metadata": {}, "outputs": [ { @@ -1014,7 +996,6 @@ { "cell_type": "code", "execution_count": 11, - "id": "1a13993f", "metadata": {}, "outputs": [ { @@ -1203,7 +1184,6 @@ }, { "cell_type": "markdown", - "id": "19bce4c4", "metadata": {}, "source": [ "Selection\n", @@ -1214,7 +1194,6 @@ }, { "cell_type": "markdown", - "id": "ba55980e", "metadata": {}, "source": [ "Selecting a single column, which initially yields a `cudf.Series` or `dask_cudf.Series`. Calling `compute` results in a `cudf.Series` (equivalent to `df.a`)." @@ -1223,7 +1202,6 @@ { "cell_type": "code", "execution_count": 12, - "id": "885989a6", "metadata": {}, "outputs": [ { @@ -1264,7 +1242,6 @@ { "cell_type": "code", "execution_count": 13, - "id": "14a74255", "metadata": {}, "outputs": [ { @@ -1304,7 +1281,6 @@ }, { "cell_type": "markdown", - "id": "498d79f2", "metadata": {}, "source": [ "## Selection by Label" @@ -1312,7 +1288,6 @@ }, { "cell_type": "markdown", - "id": "4b8b8e13", "metadata": {}, "source": [ "Selecting rows from index 2 to index 5 from columns 'a' and 'b'." @@ -1321,7 +1296,6 @@ { "cell_type": "code", "execution_count": 14, - "id": "d40bc19c", "metadata": {}, "outputs": [ { @@ -1394,7 +1368,6 @@ { "cell_type": "code", "execution_count": 15, - "id": "7688535b", "metadata": {}, "outputs": [ { @@ -1466,7 +1439,6 @@ }, { "cell_type": "markdown", - "id": "8a64ce7a", "metadata": {}, "source": [ "## Selection by Position" @@ -1474,7 +1446,6 @@ }, { "cell_type": "markdown", - "id": "dfba2bb2", "metadata": {}, "source": [ "Selecting via integers and integer slices, like numpy/pandas. Note that this functionality is not available for Dask-cuDF DataFrames." @@ -1483,7 +1454,6 @@ { "cell_type": "code", "execution_count": 16, - "id": "fb8d6d43", "metadata": {}, "outputs": [ { @@ -1507,7 +1477,6 @@ { "cell_type": "code", "execution_count": 17, - "id": "263231da", "metadata": {}, "outputs": [ { @@ -1573,7 +1542,6 @@ }, { "cell_type": "markdown", - "id": "2223b089", "metadata": {}, "source": [ "You can also select elements of a `DataFrame` or `Series` with direct index access." @@ -1582,7 +1550,6 @@ { "cell_type": "code", "execution_count": 18, - "id": "13f6158b", "metadata": {}, "outputs": [ { @@ -1646,7 +1613,6 @@ { "cell_type": "code", "execution_count": 19, - "id": "3cf4aa26", "metadata": {}, "outputs": [ { @@ -1668,7 +1634,6 @@ }, { "cell_type": "markdown", - "id": "ff633b2d", "metadata": {}, "source": [ "## Boolean Indexing" @@ -1676,7 +1641,6 @@ }, { "cell_type": "markdown", - "id": "bbdef48f", "metadata": {}, "source": [ "Selecting rows in a `DataFrame` or `Series` by direct Boolean indexing." @@ -1685,7 +1649,6 @@ { "cell_type": "code", "execution_count": 20, - "id": "becb916f", "metadata": {}, "outputs": [ { @@ -1763,7 +1726,6 @@ { "cell_type": "code", "execution_count": 21, - "id": "b9475c43", "metadata": {}, "outputs": [ { @@ -1840,7 +1802,6 @@ }, { "cell_type": "markdown", - "id": "ecf982f5", "metadata": {}, "source": [ "Selecting values from a `DataFrame` where a Boolean condition is met, via the `query` API." @@ -1849,7 +1810,6 @@ { "cell_type": "code", "execution_count": 22, - "id": "fc2fc9f9", "metadata": {}, "outputs": [ { @@ -1906,7 +1866,6 @@ { "cell_type": "code", "execution_count": 23, - "id": "1a05a07f", "metadata": {}, "outputs": [ { @@ -1962,7 +1921,6 @@ }, { "cell_type": "markdown", - "id": "7f8955a0", "metadata": {}, "source": [ "You can also pass local variables to Dask-cuDF queries, via the `local_dict` keyword. With standard cuDF, you may either use the `local_dict` keyword or directly pass the variable via the `@` keyword. Supported logical operators include `>`, `<`, `>=`, `<=`, `==`, and `!=`." @@ -1971,7 +1929,6 @@ { "cell_type": "code", "execution_count": 24, - "id": "49485a4b", "metadata": {}, "outputs": [ { @@ -2029,7 +1986,6 @@ { "cell_type": "code", "execution_count": 25, - "id": "0f3a9116", "metadata": {}, "outputs": [ { @@ -2086,7 +2042,6 @@ }, { "cell_type": "markdown", - "id": "c355af07", "metadata": {}, "source": [ "Using the `isin` method for filtering." @@ -2095,7 +2050,6 @@ { "cell_type": "code", "execution_count": 26, - "id": "f44a5a57", "metadata": {}, "outputs": [ { @@ -2158,7 +2112,6 @@ }, { "cell_type": "markdown", - "id": "79a50beb", "metadata": {}, "source": [ "## MultiIndex" @@ -2166,7 +2119,6 @@ }, { "cell_type": "markdown", - "id": "14e70234", "metadata": {}, "source": [ "cuDF supports hierarchical indexing of DataFrames using MultiIndex. Grouping hierarchically (see `Grouping` below) automatically produces a DataFrame with a MultiIndex." @@ -2175,7 +2127,6 @@ { "cell_type": "code", "execution_count": 27, - "id": "882973ed", "metadata": {}, "outputs": [ { @@ -2202,7 +2153,6 @@ }, { "cell_type": "markdown", - "id": "c10971cc", "metadata": {}, "source": [ "This index can back either axis of a DataFrame." @@ -2211,7 +2161,6 @@ { "cell_type": "code", "execution_count": 28, - "id": "5417aeb9", "metadata": {}, "outputs": [ { @@ -2289,7 +2238,6 @@ { "cell_type": "code", "execution_count": 29, - "id": "4d6fb4ff", "metadata": {}, "outputs": [ { @@ -2363,7 +2311,6 @@ }, { "cell_type": "markdown", - "id": "63dc11d8", "metadata": {}, "source": [ "Accessing values of a DataFrame with a MultiIndex. Note that slicing is not yet supported." @@ -2372,7 +2319,6 @@ { "cell_type": "code", "execution_count": 30, - "id": "3644920c", "metadata": {}, "outputs": [ { @@ -2394,7 +2340,6 @@ }, { "cell_type": "markdown", - "id": "697a9a36", "metadata": {}, "source": [ "Missing Data\n", @@ -2403,7 +2348,6 @@ }, { "cell_type": "markdown", - "id": "86655274", "metadata": {}, "source": [ "Missing data can be replaced by using the `fillna` method." @@ -2412,7 +2356,6 @@ { "cell_type": "code", "execution_count": 31, - "id": "28b06c52", "metadata": {}, "outputs": [ { @@ -2438,7 +2381,6 @@ { "cell_type": "code", "execution_count": 32, - "id": "7fb6a126", "metadata": {}, "outputs": [ { @@ -2463,7 +2405,6 @@ }, { "cell_type": "markdown", - "id": "7a0b732f", "metadata": {}, "source": [ "Operations\n", @@ -2472,7 +2413,6 @@ }, { "cell_type": "markdown", - "id": "1e8b0464", "metadata": {}, "source": [ "## Stats" @@ -2480,7 +2420,6 @@ }, { "cell_type": "markdown", - "id": "7523512b", "metadata": {}, "source": [ "Calculating descriptive statistics for a `Series`." @@ -2489,7 +2428,6 @@ { "cell_type": "code", "execution_count": 33, - "id": "f7cb604e", "metadata": {}, "outputs": [ { @@ -2510,7 +2448,6 @@ { "cell_type": "code", "execution_count": 34, - "id": "b8957a5f", "metadata": {}, "outputs": [ { @@ -2530,7 +2467,6 @@ }, { "cell_type": "markdown", - "id": "71fa928a", "metadata": {}, "source": [ "## Applymap" @@ -2538,7 +2474,6 @@ }, { "cell_type": "markdown", - "id": "d98d6f7b", "metadata": {}, "source": [ "Applying functions to a `Series`. Note that applying user defined functions directly with Dask-cuDF is not yet implemented. For now, you can use [map_partitions](http://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.map_partitions.html) to apply a function to each partition of the distributed dataframe." @@ -2547,17 +2482,8 @@ { "cell_type": "code", "execution_count": 35, - "id": "5e627811", "metadata": {}, "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/home/mmccarty/miniconda3/envs/cudf_dev/lib/python3.8/site-packages/cudf/core/series.py:2223: FutureWarning: Series.applymap is deprecated and will be removed in a future cuDF release. Use Series.apply instead.\n", - " warnings.warn(\n" - ] - }, { "data": { "text/plain": [ @@ -2593,13 +2519,12 @@ "def add_ten(num):\n", " return num + 10\n", "\n", - "df['a'].applymap(add_ten)" + "df['a'].apply(add_ten)" ] }, { "cell_type": "code", "execution_count": 36, - "id": "96cf628e", "metadata": {}, "outputs": [ { @@ -2639,7 +2564,6 @@ }, { "cell_type": "markdown", - "id": "cd69c00a", "metadata": {}, "source": [ "## Histogramming" @@ -2647,7 +2571,6 @@ }, { "cell_type": "markdown", - "id": "39982866", "metadata": {}, "source": [ "Counting the number of occurrences of each unique value of variable." @@ -2656,7 +2579,6 @@ { "cell_type": "code", "execution_count": 37, - "id": "62808675", "metadata": {}, "outputs": [ { @@ -2697,7 +2619,6 @@ { "cell_type": "code", "execution_count": 38, - "id": "5b2a42ce", "metadata": {}, "outputs": [ { @@ -2737,7 +2658,6 @@ }, { "cell_type": "markdown", - "id": "2d7e62e4", "metadata": {}, "source": [ "## String Methods" @@ -2745,7 +2665,6 @@ }, { "cell_type": "markdown", - "id": "4e704eca", "metadata": {}, "source": [ "Like pandas, cuDF provides string processing methods in the `str` attribute of `Series`. Full documentation of string methods is a work in progress. Please see the cuDF API documentation for more information." @@ -2754,7 +2673,6 @@ { "cell_type": "code", "execution_count": 39, - "id": "c73e70bb", "metadata": {}, "outputs": [ { @@ -2785,7 +2703,6 @@ { "cell_type": "code", "execution_count": 40, - "id": "697c1c94", "metadata": {}, "outputs": [ { @@ -2815,7 +2732,6 @@ }, { "cell_type": "markdown", - "id": "dfc1371e", "metadata": {}, "source": [ "## Concat" @@ -2823,7 +2739,6 @@ }, { "cell_type": "markdown", - "id": "f6fb9b53", "metadata": {}, "source": [ "Concatenating `Series` and `DataFrames` row-wise." @@ -2832,7 +2747,6 @@ { "cell_type": "code", "execution_count": 41, - "id": "60538bbd", "metadata": {}, "outputs": [ { @@ -2864,7 +2778,6 @@ { "cell_type": "code", "execution_count": 42, - "id": "17953847", "metadata": {}, "outputs": [ { @@ -2895,7 +2808,6 @@ }, { "cell_type": "markdown", - "id": "27f0d621", "metadata": {}, "source": [ "## Join" @@ -2903,7 +2815,6 @@ }, { "cell_type": "markdown", - "id": "fd35f1a7", "metadata": {}, "source": [ "Performing SQL style merges. Note that the dataframe order is not maintained, but may be restored post-merge by sorting by the index." @@ -2912,7 +2823,6 @@ { "cell_type": "code", "execution_count": 43, - "id": "52ada00a", "metadata": {}, "outputs": [ { @@ -3006,7 +2916,6 @@ { "cell_type": "code", "execution_count": 44, - "id": "409fcf92", "metadata": {}, "outputs": [ { @@ -3094,93 +3003,6 @@ }, { "cell_type": "markdown", - "id": "d9dcb86b", - "metadata": {}, - "source": [ - "## Append" - ] - }, - { - "cell_type": "markdown", - "id": "1f896819", - "metadata": {}, - "source": [ - "Appending values from another `Series` or array-like object." - ] - }, - { - "cell_type": "code", - "execution_count": 45, - "id": "9976c1ce", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/home/mmccarty/miniconda3/envs/cudf_dev/lib/python3.8/site-packages/cudf/core/indexed_frame.py:2329: FutureWarning: append is deprecated and will be removed in a future version. Use concat instead.\n", - " warnings.warn(\n" - ] - }, - { - "data": { - "text/plain": [ - "0 1\n", - "1 2\n", - "2 3\n", - "3 \n", - "4 5\n", - "0 1\n", - "1 2\n", - "2 3\n", - "3 \n", - "4 5\n", - "dtype: int64" - ] - }, - "execution_count": 45, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "s.append(s)" - ] - }, - { - "cell_type": "code", - "execution_count": 46, - "id": "fe5c54ab", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0 1\n", - "1 2\n", - "2 3\n", - "3 \n", - "4 5\n", - "0 1\n", - "1 2\n", - "2 3\n", - "3 \n", - "4 5\n", - "dtype: int64" - ] - }, - "execution_count": 46, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "ds2.append(ds2).compute()" - ] - }, - { - "cell_type": "markdown", - "id": "9fa10ef3", "metadata": {}, "source": [ "## Grouping" @@ -3188,7 +3010,6 @@ }, { "cell_type": "markdown", - "id": "8a6e41f5", "metadata": {}, "source": [ "Like pandas, cuDF and Dask-cuDF support the Split-Apply-Combine groupby paradigm." @@ -3196,8 +3017,7 @@ }, { "cell_type": "code", - "execution_count": 47, - "id": "2a8cafa7", + "execution_count": 45, "metadata": {}, "outputs": [], "source": [ @@ -3209,7 +3029,6 @@ }, { "cell_type": "markdown", - "id": "0179d60c", "metadata": {}, "source": [ "Grouping and then applying the `sum` function to the grouped data." @@ -3217,8 +3036,7 @@ }, { "cell_type": "code", - "execution_count": 48, - "id": "7c56d186", + "execution_count": 46, "metadata": {}, "outputs": [ { @@ -3281,7 +3099,7 @@ "0 100 90 100 3" ] }, - "execution_count": 48, + "execution_count": 46, "metadata": {}, "output_type": "execute_result" } @@ -3292,8 +3110,7 @@ }, { "cell_type": "code", - "execution_count": 49, - "id": "f8823b30", + "execution_count": 47, "metadata": {}, "outputs": [ { @@ -3356,7 +3173,7 @@ "0 100 90 100 3" ] }, - "execution_count": 49, + "execution_count": 47, "metadata": {}, "output_type": "execute_result" } @@ -3367,7 +3184,6 @@ }, { "cell_type": "markdown", - "id": "a84cb883", "metadata": {}, "source": [ "Grouping hierarchically then applying the `sum` function to grouped data." @@ -3375,8 +3191,7 @@ }, { "cell_type": "code", - "execution_count": 50, - "id": "2184e3ad", + "execution_count": 48, "metadata": {}, "outputs": [ { @@ -3455,7 +3270,7 @@ "0 1 27 30 27" ] }, - "execution_count": 50, + "execution_count": 48, "metadata": {}, "output_type": "execute_result" } @@ -3466,8 +3281,7 @@ }, { "cell_type": "code", - "execution_count": 51, - "id": "4ec311c1", + "execution_count": 49, "metadata": {}, "outputs": [ { @@ -3546,7 +3360,7 @@ "0 1 27 30 27" ] }, - "execution_count": 51, + "execution_count": 49, "metadata": {}, "output_type": "execute_result" } @@ -3557,7 +3371,6 @@ }, { "cell_type": "markdown", - "id": "dedfeb1b", "metadata": {}, "source": [ "Grouping and applying statistical functions to specific columns, using `agg`." @@ -3565,8 +3378,7 @@ }, { "cell_type": "code", - "execution_count": 52, - "id": "2563d8b2", + "execution_count": 50, "metadata": {}, "outputs": [ { @@ -3625,7 +3437,7 @@ "0 19 9.0 100" ] }, - "execution_count": 52, + "execution_count": 50, "metadata": {}, "output_type": "execute_result" } @@ -3636,8 +3448,7 @@ }, { "cell_type": "code", - "execution_count": 53, - "id": "22c77e75", + "execution_count": 51, "metadata": {}, "outputs": [ { @@ -3696,7 +3507,7 @@ "0 19 9.0 100" ] }, - "execution_count": 53, + "execution_count": 51, "metadata": {}, "output_type": "execute_result" } @@ -3707,7 +3518,6 @@ }, { "cell_type": "markdown", - "id": "6d074822", "metadata": {}, "source": [ "## Transpose" @@ -3715,7 +3525,6 @@ }, { "cell_type": "markdown", - "id": "16c0f0a8", "metadata": {}, "source": [ "Transposing a dataframe, using either the `transpose` method or `T` property. Currently, all columns must have the same type. Transposing is not currently implemented in Dask-cuDF." @@ -3723,8 +3532,7 @@ }, { "cell_type": "code", - "execution_count": 54, - "id": "e265861e", + "execution_count": 52, "metadata": {}, "outputs": [ { @@ -3779,7 +3587,7 @@ "2 3 6" ] }, - "execution_count": 54, + "execution_count": 52, "metadata": {}, "output_type": "execute_result" } @@ -3791,8 +3599,7 @@ }, { "cell_type": "code", - "execution_count": 55, - "id": "1fe9b972", + "execution_count": 53, "metadata": {}, "outputs": [ { @@ -3844,7 +3651,7 @@ "b 4 5 6" ] }, - "execution_count": 55, + "execution_count": 53, "metadata": {}, "output_type": "execute_result" } @@ -3855,7 +3662,6 @@ }, { "cell_type": "markdown", - "id": "9ce02827", "metadata": {}, "source": [ "Time Series\n", @@ -3864,7 +3670,6 @@ }, { "cell_type": "markdown", - "id": "fec907ff", "metadata": {}, "source": [ "`DataFrames` supports `datetime` typed columns, which allow users to interact with and filter data based on specific timestamps." @@ -3872,8 +3677,7 @@ }, { "cell_type": "code", - "execution_count": 56, - "id": "7a425d3f", + "execution_count": 54, "metadata": {}, "outputs": [ { @@ -3934,7 +3738,7 @@ "3 2018-11-23 0.103839" ] }, - "execution_count": 56, + "execution_count": 54, "metadata": {}, "output_type": "execute_result" } @@ -3952,8 +3756,7 @@ }, { "cell_type": "code", - "execution_count": 57, - "id": "87f0e56e", + "execution_count": 55, "metadata": {}, "outputs": [ { @@ -4014,7 +3817,7 @@ "3 2018-11-23 0.103839" ] }, - "execution_count": 57, + "execution_count": 55, "metadata": {}, "output_type": "execute_result" } @@ -4026,7 +3829,6 @@ }, { "cell_type": "markdown", - "id": "0d0e541c", "metadata": {}, "source": [ "Categoricals\n", @@ -4035,7 +3837,6 @@ }, { "cell_type": "markdown", - "id": "a36f9543", "metadata": {}, "source": [ "`DataFrames` support categorical columns." @@ -4043,8 +3844,7 @@ }, { "cell_type": "code", - "execution_count": 58, - "id": "05bd8be8", + "execution_count": 56, "metadata": {}, "outputs": [ { @@ -4117,7 +3917,7 @@ "5 6 e" ] }, - "execution_count": 58, + "execution_count": 56, "metadata": {}, "output_type": "execute_result" } @@ -4130,8 +3930,7 @@ }, { "cell_type": "code", - "execution_count": 59, - "id": "676b4963", + "execution_count": 57, "metadata": {}, "outputs": [ { @@ -4204,7 +4003,7 @@ "5 6 e" ] }, - "execution_count": 59, + "execution_count": 57, "metadata": {}, "output_type": "execute_result" } @@ -4216,7 +4015,6 @@ }, { "cell_type": "markdown", - "id": "e24f2e7b", "metadata": {}, "source": [ "Accessing the categories of a column. Note that this is currently not supported in Dask-cuDF." @@ -4224,8 +4022,7 @@ }, { "cell_type": "code", - "execution_count": 60, - "id": "06310c36", + "execution_count": 58, "metadata": {}, "outputs": [ { @@ -4234,7 +4031,7 @@ "StringIndex(['a' 'b' 'e'], dtype='object')" ] }, - "execution_count": 60, + "execution_count": 58, "metadata": {}, "output_type": "execute_result" } @@ -4245,7 +4042,6 @@ }, { "cell_type": "markdown", - "id": "4eb6f858", "metadata": {}, "source": [ "Accessing the underlying code values of each categorical observation." @@ -4253,8 +4049,7 @@ }, { "cell_type": "code", - "execution_count": 61, - "id": "0f6db260", + "execution_count": 59, "metadata": {}, "outputs": [ { @@ -4269,7 +4064,7 @@ "dtype: uint8" ] }, - "execution_count": 61, + "execution_count": 59, "metadata": {}, "output_type": "execute_result" } @@ -4280,8 +4075,7 @@ }, { "cell_type": "code", - "execution_count": 62, - "id": "b87c4375", + "execution_count": 60, "metadata": {}, "outputs": [ { @@ -4296,7 +4090,7 @@ "dtype: uint8" ] }, - "execution_count": 62, + "execution_count": 60, "metadata": {}, "output_type": "execute_result" } @@ -4307,7 +4101,6 @@ }, { "cell_type": "markdown", - "id": "3f816916", "metadata": {}, "source": [ "Converting Data Representation\n", @@ -4316,7 +4109,6 @@ }, { "cell_type": "markdown", - "id": "64a17f6d", "metadata": {}, "source": [ "## Pandas" @@ -4324,7 +4116,6 @@ }, { "cell_type": "markdown", - "id": "3acdcacc", "metadata": {}, "source": [ "Converting a cuDF and Dask-cuDF `DataFrame` to a pandas `DataFrame`." @@ -4332,8 +4123,7 @@ }, { "cell_type": "code", - "execution_count": 63, - "id": "d1fed919", + "execution_count": 61, "metadata": {}, "outputs": [ { @@ -4418,7 +4208,7 @@ "4 4 15 4 1 0" ] }, - "execution_count": 63, + "execution_count": 61, "metadata": {}, "output_type": "execute_result" } @@ -4429,8 +4219,7 @@ }, { "cell_type": "code", - "execution_count": 64, - "id": "567c7363", + "execution_count": 62, "metadata": {}, "outputs": [ { @@ -4515,7 +4304,7 @@ "4 4 15 4 1 0" ] }, - "execution_count": 64, + "execution_count": 62, "metadata": {}, "output_type": "execute_result" } @@ -4526,7 +4315,6 @@ }, { "cell_type": "markdown", - "id": "c2121453", "metadata": {}, "source": [ "## Numpy" @@ -4534,7 +4322,6 @@ }, { "cell_type": "markdown", - "id": "a9faa2c5", "metadata": {}, "source": [ "Converting a cuDF or Dask-cuDF `DataFrame` to a numpy `ndarray`." @@ -4542,8 +4329,7 @@ }, { "cell_type": "code", - "execution_count": 65, - "id": "5490d226", + "execution_count": 63, "metadata": {}, "outputs": [ { @@ -4571,7 +4357,7 @@ " [19, 0, 19, 0, 0]])" ] }, - "execution_count": 65, + "execution_count": 63, "metadata": {}, "output_type": "execute_result" } @@ -4582,8 +4368,7 @@ }, { "cell_type": "code", - "execution_count": 66, - "id": "b77ac8ae", + "execution_count": 64, "metadata": {}, "outputs": [ { @@ -4611,7 +4396,7 @@ " [19, 0, 19, 0, 0]])" ] }, - "execution_count": 66, + "execution_count": 64, "metadata": {}, "output_type": "execute_result" } @@ -4622,7 +4407,6 @@ }, { "cell_type": "markdown", - "id": "1d24d30f", "metadata": {}, "source": [ "Converting a cuDF or Dask-cuDF `Series` to a numpy `ndarray`." @@ -4630,8 +4414,7 @@ }, { "cell_type": "code", - "execution_count": 67, - "id": "f71a0ba3", + "execution_count": 65, "metadata": {}, "outputs": [ { @@ -4641,7 +4424,7 @@ " 17, 18, 19])" ] }, - "execution_count": 67, + "execution_count": 65, "metadata": {}, "output_type": "execute_result" } @@ -4652,8 +4435,7 @@ }, { "cell_type": "code", - "execution_count": 68, - "id": "a45a74b5", + "execution_count": 66, "metadata": {}, "outputs": [ { @@ -4663,7 +4445,7 @@ " 17, 18, 19])" ] }, - "execution_count": 68, + "execution_count": 66, "metadata": {}, "output_type": "execute_result" } @@ -4674,7 +4456,6 @@ }, { "cell_type": "markdown", - "id": "0d78a4d2", "metadata": {}, "source": [ "## Arrow" @@ -4682,7 +4463,6 @@ }, { "cell_type": "markdown", - "id": "7e35b829", "metadata": {}, "source": [ "Converting a cuDF or Dask-cuDF `DataFrame` to a PyArrow `Table`." @@ -4690,8 +4470,7 @@ }, { "cell_type": "code", - "execution_count": 69, - "id": "bb9e9a2a", + "execution_count": 67, "metadata": {}, "outputs": [ { @@ -4711,7 +4490,7 @@ "agg_col2: [[1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0]]" ] }, - "execution_count": 69, + "execution_count": 67, "metadata": {}, "output_type": "execute_result" } @@ -4722,8 +4501,7 @@ }, { "cell_type": "code", - "execution_count": 70, - "id": "4d020de7", + "execution_count": 68, "metadata": {}, "outputs": [ { @@ -4743,7 +4521,7 @@ "agg_col2: [[1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0]]" ] }, - "execution_count": 70, + "execution_count": 68, "metadata": {}, "output_type": "execute_result" } @@ -4754,7 +4532,6 @@ }, { "cell_type": "markdown", - "id": "ace7b4f9", "metadata": {}, "source": [ "Getting Data In/Out\n", @@ -4763,7 +4540,6 @@ }, { "cell_type": "markdown", - "id": "161abb12", "metadata": {}, "source": [ "## CSV" @@ -4771,7 +4547,6 @@ }, { "cell_type": "markdown", - "id": "7e5dc381", "metadata": {}, "source": [ "Writing to a CSV file." @@ -4779,8 +4554,7 @@ }, { "cell_type": "code", - "execution_count": 71, - "id": "3a59715f", + "execution_count": 69, "metadata": {}, "outputs": [], "source": [ @@ -4792,8 +4566,7 @@ }, { "cell_type": "code", - "execution_count": 72, - "id": "4ebe98ed", + "execution_count": 70, "metadata": {}, "outputs": [], "source": [ @@ -4802,7 +4575,6 @@ }, { "cell_type": "markdown", - "id": "0479fc4f", "metadata": {}, "source": [ "Reading from a csv file." @@ -4810,8 +4582,7 @@ }, { "cell_type": "code", - "execution_count": 73, - "id": "1a70e831", + "execution_count": 71, "metadata": {}, "outputs": [ { @@ -5031,7 +4802,7 @@ "19 19 0 19 0 0" ] }, - "execution_count": 73, + "execution_count": 71, "metadata": {}, "output_type": "execute_result" } @@ -5043,8 +4814,7 @@ }, { "cell_type": "code", - "execution_count": 74, - "id": "4c3d9ca3", + "execution_count": 72, "metadata": {}, "outputs": [ { @@ -5264,7 +5034,7 @@ "19 19 0 19 0 0" ] }, - "execution_count": 74, + "execution_count": 72, "metadata": {}, "output_type": "execute_result" } @@ -5276,7 +5046,6 @@ }, { "cell_type": "markdown", - "id": "3d739c6e", "metadata": {}, "source": [ "Reading all CSV files in a directory into a single `dask_cudf.DataFrame`, using the star wildcard." @@ -5284,8 +5053,7 @@ }, { "cell_type": "code", - "execution_count": 75, - "id": "cb7187d2", + "execution_count": 73, "metadata": {}, "outputs": [ { @@ -5685,7 +5453,7 @@ "19 19 0 19 0 0" ] }, - "execution_count": 75, + "execution_count": 73, "metadata": {}, "output_type": "execute_result" } @@ -5697,7 +5465,6 @@ }, { "cell_type": "markdown", - "id": "c0939a1e", "metadata": {}, "source": [ "## Parquet" @@ -5705,7 +5472,6 @@ }, { "cell_type": "markdown", - "id": "14e6a634", "metadata": {}, "source": [ "Writing to parquet files, using the CPU via PyArrow." @@ -5713,8 +5479,7 @@ }, { "cell_type": "code", - "execution_count": 76, - "id": "1812346f", + "execution_count": 74, "metadata": {}, "outputs": [], "source": [ @@ -5723,7 +5488,6 @@ }, { "cell_type": "markdown", - "id": "093cd0fe", "metadata": {}, "source": [ "Reading parquet files with a GPU-accelerated parquet reader." @@ -5731,8 +5495,7 @@ }, { "cell_type": "code", - "execution_count": 77, - "id": "2354b20b", + "execution_count": 75, "metadata": {}, "outputs": [ { @@ -5952,7 +5715,7 @@ "19 19 0 19 0 0" ] }, - "execution_count": 77, + "execution_count": 75, "metadata": {}, "output_type": "execute_result" } @@ -5964,7 +5727,6 @@ }, { "cell_type": "markdown", - "id": "132c3ff2", "metadata": {}, "source": [ "Writing to parquet files from a `dask_cudf.DataFrame` using PyArrow under the hood." @@ -5972,8 +5734,7 @@ }, { "cell_type": "code", - "execution_count": 78, - "id": "c5d7686c", + "execution_count": 76, "metadata": {}, "outputs": [ { @@ -5982,7 +5743,7 @@ "(None,)" ] }, - "execution_count": 78, + "execution_count": 76, "metadata": {}, "output_type": "execute_result" } @@ -5993,7 +5754,6 @@ }, { "cell_type": "markdown", - "id": "0d73d1dd", "metadata": {}, "source": [ "## ORC" @@ -6001,7 +5761,6 @@ }, { "cell_type": "markdown", - "id": "61b5f466", "metadata": {}, "source": [ "Reading ORC files." @@ -6009,34 +5768,19 @@ }, { "cell_type": "code", - "execution_count": 79, - "id": "93364ff3", + "execution_count": 77, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'/home/ashwin/workspace/rapids/cudf/python/cudf/cudf/tests/data/orc/TestOrcFile.test1.orc'" - ] - }, - "execution_count": 79, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "import os\n", "from pathlib import Path\n", - "current_dir = os.path.dirname(os.path.realpath(\"__file__\"))\n", - "cudf_root = Path(current_dir).parents[3]\n", - "file_path = os.path.join(cudf_root, \"python\", \"cudf\", \"cudf\", \"tests\", \"data\", \"orc\", \"TestOrcFile.test1.orc\")\n", - "file_path" + "cudf_root = Path(\".\").absolute().parents[3]\n", + "orc_file = Path(\"python/cudf/cudf/tests/data/orc/TestOrcFile.test1.orc\")\n", + "file_path = cudf_root / orc_file" ] }, { "cell_type": "code", - "execution_count": 80, - "id": "2b6785c7", + "execution_count": 78, "metadata": {}, "outputs": [ { @@ -6127,7 +5871,7 @@ "1 [{'key': 'chani', 'value': {'int1': 5, 'string... " ] }, - "execution_count": 80, + "execution_count": 78, "metadata": {}, "output_type": "execute_result" } @@ -6139,7 +5883,6 @@ }, { "cell_type": "markdown", - "id": "238ce6a4", "metadata": {}, "source": [ "Dask Performance Tips\n", @@ -6154,7 +5897,6 @@ }, { "cell_type": "markdown", - "id": "3de9aeca", "metadata": {}, "source": [ "First, we set up a GPU cluster. With our `client` set up, Dask-cuDF computation will be distributed across the GPUs in the cluster." @@ -6162,255 +5904,15 @@ }, { "cell_type": "code", - "execution_count": 81, - "id": "e4852d48", + "execution_count": 79, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "2022-04-21 13:26:06,860 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n", - "2022-04-21 13:26:06,904 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n" + "2022-05-12 22:41:08,024 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n" ] - }, - { - "data": { - "text/html": [ - "
\n", - "
\n", - "
\n", - "

Client

\n", - "

Client-20d00fd5-c198-11ec-906c-c8d9d2247354

\n", - " \n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "\n", - "
Connection method: Cluster objectCluster type: dask_cuda.LocalCUDACluster
\n", - " Dashboard: http://127.0.0.1:8787/status\n", - "
\n", - "\n", - " \n", - "
\n", - "

Cluster Info

\n", - "
\n", - "
\n", - "
\n", - "
\n", - "

LocalCUDACluster

\n", - "

47648c26

\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "\n", - "\n", - " \n", - "
\n", - " Dashboard: http://127.0.0.1:8787/status\n", - " \n", - " Workers: 2\n", - "
\n", - " Total threads: 2\n", - " \n", - " Total memory: 125.65 GiB\n", - "
Status: runningUsing processes: True
\n", - "\n", - "
\n", - " \n", - "

Scheduler Info

\n", - "
\n", - "\n", - "
\n", - "
\n", - "
\n", - "
\n", - "

Scheduler

\n", - "

Scheduler-f28bff16-cb70-452c-b8af-b9299a8d7b20

\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " Comm: tcp://127.0.0.1:33995\n", - " \n", - " Workers: 2\n", - "
\n", - " Dashboard: http://127.0.0.1:8787/status\n", - " \n", - " Total threads: 2\n", - "
\n", - " Started: Just now\n", - " \n", - " Total memory: 125.65 GiB\n", - "
\n", - "
\n", - "
\n", - "\n", - "
\n", - " \n", - "

Workers

\n", - "
\n", - "\n", - " \n", - "
\n", - "
\n", - "
\n", - "
\n", - " \n", - "

Worker: 0

\n", - "
\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "\n", - " \n", - "\n", - "
\n", - " Comm: tcp://127.0.0.1:40479\n", - " \n", - " Total threads: 1\n", - "
\n", - " Dashboard: http://127.0.0.1:38985/status\n", - " \n", - " Memory: 62.82 GiB\n", - "
\n", - " Nanny: tcp://127.0.0.1:33447\n", - "
\n", - " Local directory: /home/ashwin/workspace/rapids/cudf/docs/cudf/source/user_guide/dask-worker-space/worker-be7zg92w\n", - "
\n", - " GPU: NVIDIA RTX A6000\n", - " \n", - " GPU memory: 47.51 GiB\n", - "
\n", - "
\n", - "
\n", - "
\n", - " \n", - "
\n", - "
\n", - "
\n", - "
\n", - " \n", - "

Worker: 1

\n", - "
\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "\n", - " \n", - "\n", - "
\n", - " Comm: tcp://127.0.0.1:40519\n", - " \n", - " Total threads: 1\n", - "
\n", - " Dashboard: http://127.0.0.1:40951/status\n", - " \n", - " Memory: 62.82 GiB\n", - "
\n", - " Nanny: tcp://127.0.0.1:39133\n", - "
\n", - " Local directory: /home/ashwin/workspace/rapids/cudf/docs/cudf/source/user_guide/dask-worker-space/worker-3v0c20ux\n", - "
\n", - " GPU: NVIDIA RTX A6000\n", - " \n", - " GPU memory: 47.54 GiB\n", - "
\n", - "
\n", - "
\n", - "
\n", - " \n", - "\n", - "
\n", - "
\n", - "\n", - "
\n", - "
\n", - "
\n", - "
\n", - " \n", - "\n", - "
\n", - "
" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 81, - "metadata": {}, - "output_type": "execute_result" } ], "source": [ @@ -6420,13 +5922,11 @@ "from dask_cuda import LocalCUDACluster\n", "\n", "cluster = LocalCUDACluster()\n", - "client = Client(cluster)\n", - "client" + "client = Client(cluster)" ] }, { "cell_type": "markdown", - "id": "181e4d10", "metadata": {}, "source": [ "### Persisting Data\n", @@ -6435,8 +5935,7 @@ }, { "cell_type": "code", - "execution_count": 82, - "id": "d47a1142", + "execution_count": 80, "metadata": {}, "outputs": [ { @@ -6512,7 +6011,7 @@ "" ] }, - "execution_count": 82, + "execution_count": 80, "metadata": {}, "output_type": "execute_result" } @@ -6528,37 +6027,38 @@ }, { "cell_type": "code", - "execution_count": 83, - "id": "c3cb612a", + "execution_count": 81, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Thu Apr 21 13:26:07 2022 \r\n", - "+-----------------------------------------------------------------------------+\r\n", - "| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |\r\n", - "|-------------------------------+----------------------+----------------------+\r\n", - "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\r\n", - "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\r\n", - "| | | MIG M. |\r\n", - "|===============================+======================+======================|\r\n", - "| 0 Quadro GV100 Off | 00000000:15:00.0 Off | Off |\r\n", - "| 39% 52C P2 51W / 250W | 1115MiB / 32508MiB | 0% Default |\r\n", - "| | | N/A |\r\n", - "+-------------------------------+----------------------+----------------------+\r\n", - "| 1 Quadro GV100 Off | 00000000:2D:00.0 Off | Off |\r\n", - "| 43% 57C P2 52W / 250W | 306MiB / 32498MiB | 0% Default |\r\n", - "| | | N/A |\r\n", - "+-------------------------------+----------------------+----------------------+\r\n", - " \r\n", - "+-----------------------------------------------------------------------------+\r\n", - "| Processes: |\r\n", - "| GPU GI CI PID Type Process name GPU Memory |\r\n", - "| ID ID Usage |\r\n", - "|=============================================================================|\r\n", - "+-----------------------------------------------------------------------------+\r\n" + "Thu May 12 22:41:08 2022 \n", + "+-----------------------------------------------------------------------------+\n", + "| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |\n", + "|-------------------------------+----------------------+----------------------+\n", + "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", + "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", + "| | | MIG M. |\n", + "|===============================+======================+======================|\n", + "| 0 NVIDIA RTX A6000 On | 00000000:65:00.0 On | Off |\n", + "| 30% 41C P2 77W / 300W | 1380MiB / 49140MiB | 2% Default |\n", + "| | | N/A |\n", + "+-------------------------------+----------------------+----------------------+\n", + " \n", + "+-----------------------------------------------------------------------------+\n", + "| Processes: |\n", + "| GPU GI CI PID Type Process name GPU Memory |\n", + "| ID ID Usage |\n", + "|=============================================================================|\n", + "| 0 N/A N/A 1674 G 159MiB |\n", + "| 0 N/A N/A 1950 G 47MiB |\n", + "| 0 N/A N/A 13521 G 132MiB |\n", + "| 0 N/A N/A 304797 G 36MiB |\n", + "| 0 N/A N/A 488366 C 743MiB |\n", + "| 0 N/A N/A 488425 C 257MiB |\n", + "+-----------------------------------------------------------------------------+\n" ] } ], @@ -6568,7 +6068,6 @@ }, { "cell_type": "markdown", - "id": "b98810c4", "metadata": {}, "source": [ "Because Dask is lazy, the computation has not yet occurred. We can see that there are twenty tasks in the task graph and we've used about 800 MB of memory. We can force computation by using `persist`. By forcing execution, the result is now explicitly in memory and our task graph only contains one task per partition (the baseline)." @@ -6576,8 +6075,7 @@ }, { "cell_type": "code", - "execution_count": 84, - "id": "a929577c", + "execution_count": 82, "metadata": {}, "outputs": [ { @@ -6653,7 +6151,7 @@ "" ] }, - "execution_count": 84, + "execution_count": 82, "metadata": {}, "output_type": "execute_result" } @@ -6665,28 +6163,23 @@ }, { "cell_type": "code", - "execution_count": 85, - "id": "8aa7c079", + "execution_count": 83, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Thu Apr 21 13:26:08 2022 \r\n", + "Thu May 12 22:41:14 2022 \r\n", "+-----------------------------------------------------------------------------+\r\n", - "| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |\r\n", + "| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |\r\n", "|-------------------------------+----------------------+----------------------+\r\n", "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\r\n", "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\r\n", "| | | MIG M. |\r\n", "|===============================+======================+======================|\r\n", - "| 0 Quadro GV100 Off | 00000000:15:00.0 Off | Off |\r\n", - "| 39% 52C P2 52W / 250W | 1115MiB / 32508MiB | 3% Default |\r\n", - "| | | N/A |\r\n", - "+-------------------------------+----------------------+----------------------+\r\n", - "| 1 Quadro GV100 Off | 00000000:2D:00.0 Off | Off |\r\n", - "| 43% 57C P2 51W / 250W | 306MiB / 32498MiB | 0% Default |\r\n", + "| 0 NVIDIA RTX A6000 On | 00000000:65:00.0 On | Off |\r\n", + "| 30% 42C P2 77W / 300W | 1942MiB / 49140MiB | 0% Default |\r\n", "| | | N/A |\r\n", "+-------------------------------+----------------------+----------------------+\r\n", " \r\n", @@ -6695,17 +6188,23 @@ "| GPU GI CI PID Type Process name GPU Memory |\r\n", "| ID ID Usage |\r\n", "|=============================================================================|\r\n", + "| 0 N/A N/A 1674 G 159MiB |\r\n", + "| 0 N/A N/A 1950 G 47MiB |\r\n", + "| 0 N/A N/A 13521 G 132MiB |\r\n", + "| 0 N/A N/A 304797 G 36MiB |\r\n", + "| 0 N/A N/A 488366 C 743MiB |\r\n", + "| 0 N/A N/A 488425 C 819MiB |\r\n", "+-----------------------------------------------------------------------------+\r\n" ] } ], "source": [ - "!nvidia-smi" + "# Sleep to ensure the persist finishes and shows in the memory usage\n", + "!sleep 5; nvidia-smi" ] }, { "cell_type": "markdown", - "id": "ff9e14b6", "metadata": {}, "source": [ "Because we forced computation, we now have a larger object in distributed GPU memory." @@ -6713,7 +6212,6 @@ }, { "cell_type": "markdown", - "id": "bb3b3dee", "metadata": {}, "source": [ "### Wait\n", @@ -6724,8 +6222,7 @@ }, { "cell_type": "code", - "execution_count": 86, - "id": "ef71bf00", + "execution_count": 84, "metadata": {}, "outputs": [], "source": [ @@ -6737,22 +6234,20 @@ "ddf1 = dask_cudf.from_cudf(df1, npartitions=100)\n", "\n", "def func(df):\n", - " time.sleep(random.randint(1, 60))\n", + " time.sleep(random.randint(1, 10))\n", " return (df + 5) * 3 - 11" ] }, { "cell_type": "markdown", - "id": "e1099ec0", "metadata": {}, "source": [ - "This function will do a basic transformation of every column in the dataframe, but the time spent in the function will vary due to the `time.sleep` statement randomly adding 1-60 seconds of time. We'll run this on every partition of our dataframe using `map_partitions`, which adds the task to our task-graph, and store the result. We can then call `persist` to force execution." + "This function will do a basic transformation of every column in the dataframe, but the time spent in the function will vary due to the `time.sleep` statement randomly adding 1-10 seconds of time. We'll run this on every partition of our dataframe using `map_partitions`, which adds the task to our task-graph, and store the result. We can then call `persist` to force execution." ] }, { "cell_type": "code", - "execution_count": 87, - "id": "700dd799", + "execution_count": 85, "metadata": {}, "outputs": [], "source": [ @@ -6762,7 +6257,6 @@ }, { "cell_type": "markdown", - "id": "5eb83a7e", "metadata": {}, "source": [ "However, some partitions will be done **much** sooner than others. If we had downstream processes that should wait for all partitions to be completed, we can enforce that behavior using `wait`." @@ -6770,17 +6264,16 @@ }, { "cell_type": "code", - "execution_count": 88, - "id": "73bccf94", + "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "DoneAndNotDoneFutures(done={, , , , }, not_done=set())" + "DoneAndNotDoneFutures(done={, , , , }, not_done=set())" ] }, - "execution_count": 88, + "execution_count": 86, "metadata": {}, "output_type": "execute_result" } @@ -6791,22 +6284,14 @@ }, { "cell_type": "markdown", - "id": "447301f5", "metadata": {}, "source": [ "With `wait`, we can safely proceed on in our workflow." ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7e06fcf4", - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { + "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", @@ -6822,9 +6307,22 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.13" + "version": "3.9.12" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": true, + "sideBar": true, + "skip_h1_title": false, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": {}, + "toc_section_display": true, + "toc_window_display": false } }, "nbformat": 4, - "nbformat_minor": 5 + "nbformat_minor": 4 }