diff --git a/docs/cudf/source/user_guide/10min.ipynb b/docs/cudf/source/user_guide/10min.ipynb index 870e334c216..923a251514d 100644 --- a/docs/cudf/source/user_guide/10min.ipynb +++ b/docs/cudf/source/user_guide/10min.ipynb @@ -7,7 +7,7 @@ "10 Minutes to cuDF and Dask-cuDF\n", "=======================\n", "\n", - "Modeled after 10 Minutes to Pandas, this is a short introduction to cuDF and Dask-cuDF, geared mainly for new users.\n", + "Modelled after 10 Minutes to Pandas, this is a short introduction to cuDF and Dask-cuDF, geared mainly for new users.\n", "\n", "### What are these Libraries?\n", "\n", @@ -15,7 +15,7 @@ "\n", "[Dask](https://dask.org/) is a flexible library for parallel computing in Python that makes scaling out your workflow smooth and simple. On the CPU, Dask uses Pandas to execute operations in parallel on DataFrame partitions.\n", "\n", - "[Dask-cuDF](https://github.com/rapidsai/cudf/tree/main/python/dask_cudf) extends Dask where necessary to allow its DataFrame partitions to be processed by cuDF GPU DataFrames as opposed to Pandas DataFrames. For instance, when you call dask_cudf.read_csv(...), your cluster's GPUs do the work of parsing the CSV file(s) with underlying cudf.read_csv().\n", + "[Dask-cuDF](https://github.com/rapidsai/cudf/tree/main/python/dask_cudf) extends Dask where necessary to allow its DataFrame partitions to be processed by cuDF GPU DataFrames as opposed to Pandas DataFrames. For instance, when you call `dask_cudf.read_csv(...)`, your cluster's GPUs do the work of parsing the CSV file(s) with underlying [`cudf.read_csv()`](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.read_csv.html).\n", "\n", "\n", "### When to use cuDF and Dask-cuDF\n", @@ -92,11 +92,9 @@ { "data": { "text/plain": [ - "0 1\n", - "1 2\n", - "2 3\n", - "3 \n", - "4 4\n", + "0 1\n", + "1 2\n", + "2 3\n", "dtype: int64" ] }, @@ -106,8 +104,10 @@ } ], "source": [ - "ds = dask_cudf.from_cudf(s, npartitions=2) \n", - "ds.compute()" + "ds = dask_cudf.from_cudf(s, npartitions=2)\n", + "# Note the call to head here to show the first few entries, see more details\n", + "# below.\n", + "ds.head(n=3)" ] }, { @@ -310,6 +310,29 @@ "df" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we will convert our cuDF dataframe into a dask-cuDF equivalent. Here we see\n", + "a first difference. To inspect the data we must call a method (here `.head()` to\n", + "look at the first few values). In the general case (see the end of this\n", + "notebook), the data in `ddf` will be distributed across multiple GPUs.\n", + "\n", + "In this small case, we could call `ddf.compute()` to obtain a cuDF object from\n", + "the dask-cuDF object. In general, we should avoid calling `.compute()` on large\n", + "dataframes, and restrict ourselves to using it when we have some (relatively)\n", + "small postprocessed result that we wish to inspect. Hence, throughout this\n", + "notebook we will generally call `.head()` to inspect the first few values of a\n", + "dask-cuDF dataframe, occasionally calling out places where we use `.compute()`\n", + "and why.\n", + "\n", + "*To understand more of the differences between how cuDF and dask-cuDF behave\n", + "here, visit the [10 Minutes to\n", + "Dask](https://docs.dask.org/en/stable/10-minutes-to-dask.html) tutorial after\n", + "this one.*" + ] + }, { "cell_type": "code", "execution_count": 5, @@ -372,122 +395,17 @@ " 15\n", " 4\n", " \n", - " \n", - " 5\n", - " 5\n", - " 14\n", - " 5\n", - " \n", - " \n", - " 6\n", - " 6\n", - " 13\n", - " 6\n", - " \n", - " \n", - " 7\n", - " 7\n", - " 12\n", - " 7\n", - " \n", - " \n", - " 8\n", - " 8\n", - " 11\n", - " 8\n", - " \n", - " \n", - " 9\n", - " 9\n", - " 10\n", - " 9\n", - " \n", - " \n", - " 10\n", - " 10\n", - " 9\n", - " 10\n", - " \n", - " \n", - " 11\n", - " 11\n", - " 8\n", - " 11\n", - " \n", - " \n", - " 12\n", - " 12\n", - " 7\n", - " 12\n", - " \n", - " \n", - " 13\n", - " 13\n", - " 6\n", - " 13\n", - " \n", - " \n", - " 14\n", - " 14\n", - " 5\n", - " 14\n", - " \n", - " \n", - " 15\n", - " 15\n", - " 4\n", - " 15\n", - " \n", - " \n", - " 16\n", - " 16\n", - " 3\n", - " 16\n", - " \n", - " \n", - " 17\n", - " 17\n", - " 2\n", - " 17\n", - " \n", - " \n", - " 18\n", - " 18\n", - " 1\n", - " 18\n", - " \n", - " \n", - " 19\n", - " 19\n", - " 0\n", - " 19\n", - " \n", " \n", "\n", "" ], "text/plain": [ - " a b c\n", - "0 0 19 0\n", - "1 1 18 1\n", - "2 2 17 2\n", - "3 3 16 3\n", - "4 4 15 4\n", - "5 5 14 5\n", - "6 6 13 6\n", - "7 7 12 7\n", - "8 8 11 8\n", - "9 9 10 9\n", - "10 10 9 10\n", - "11 11 8 11\n", - "12 12 7 12\n", - "13 13 6 13\n", - "14 14 5 14\n", - "15 15 4 15\n", - "16 16 3 16\n", - "17 17 2 17\n", - "18 18 1 18\n", - "19 19 0 19" + " a b c\n", + "0 0 19 0\n", + "1 1 18 1\n", + "2 2 17 2\n", + "3 3 16 3\n", + "4 4 15 4" ] }, "execution_count": 5, @@ -497,7 +415,7 @@ ], "source": [ "ddf = dask_cudf.from_cudf(df, npartitions=2) \n", - "ddf.compute()" + "ddf.head()" ] }, { @@ -624,26 +542,14 @@ " 1\n", " 0.2\n", " \n", - " \n", - " 2\n", - " 2\n", - " <NA>\n", - " \n", - " \n", - " 3\n", - " 3\n", - " 0.3\n", - " \n", " \n", "\n", "" ], "text/plain": [ - " a b\n", - "0 0 0.1\n", - "1 1 0.2\n", - "2 2 \n", - "3 3 0.3" + " a b\n", + "0 0 0.1\n", + "1 1 0.2" ] }, "execution_count": 7, @@ -653,7 +559,7 @@ ], "source": [ "dask_gdf = dask_cudf.from_cudf(gdf, npartitions=2)\n", - "dask_gdf.compute()" + "dask_gdf.head(n=2)" ] }, { @@ -1055,122 +961,17 @@ " 4\n", " 15\n", " \n", - " \n", - " 14\n", - " 14\n", - " 5\n", - " 14\n", - " \n", - " \n", - " 13\n", - " 13\n", - " 6\n", - " 13\n", - " \n", - " \n", - " 12\n", - " 12\n", - " 7\n", - " 12\n", - " \n", - " \n", - " 11\n", - " 11\n", - " 8\n", - " 11\n", - " \n", - " \n", - " 10\n", - " 10\n", - " 9\n", - " 10\n", - " \n", - " \n", - " 9\n", - " 9\n", - " 10\n", - " 9\n", - " \n", - " \n", - " 8\n", - " 8\n", - " 11\n", - " 8\n", - " \n", - " \n", - " 7\n", - " 7\n", - " 12\n", - " 7\n", - " \n", - " \n", - " 6\n", - " 6\n", - " 13\n", - " 6\n", - " \n", - " \n", - " 5\n", - " 5\n", - " 14\n", - " 5\n", - " \n", - " \n", - " 4\n", - " 4\n", - " 15\n", - " 4\n", - " \n", - " \n", - " 3\n", - " 3\n", - " 16\n", - " 3\n", - " \n", - " \n", - " 2\n", - " 2\n", - " 17\n", - " 2\n", - " \n", - " \n", - " 1\n", - " 1\n", - " 18\n", - " 1\n", - " \n", - " \n", - " 0\n", - " 0\n", - " 19\n", - " 0\n", - " \n", " \n", "\n", "" ], "text/plain": [ - " a b c\n", - "19 19 0 19\n", - "18 18 1 18\n", - "17 17 2 17\n", - "16 16 3 16\n", - "15 15 4 15\n", - "14 14 5 14\n", - "13 13 6 13\n", - "12 12 7 12\n", - "11 11 8 11\n", - "10 10 9 10\n", - "9 9 10 9\n", - "8 8 11 8\n", - "7 7 12 7\n", - "6 6 13 6\n", - "5 5 14 5\n", - "4 4 15 4\n", - "3 3 16 3\n", - "2 2 17 2\n", - "1 1 18 1\n", - "0 0 19 0" + " a b c\n", + "19 19 0 19\n", + "18 18 1 18\n", + "17 17 2 17\n", + "16 16 3 16\n", + "15 15 4 15" ] }, "execution_count": 11, @@ -1179,7 +980,7 @@ } ], "source": [ - "ddf.sort_values(by='b').compute()" + "ddf.sort_values(by='b').head()" ] }, { @@ -1247,26 +1048,11 @@ { "data": { "text/plain": [ - "0 0\n", - "1 1\n", - "2 2\n", - "3 3\n", - "4 4\n", - "5 5\n", - "6 6\n", - "7 7\n", - "8 8\n", - "9 9\n", - "10 10\n", - "11 11\n", - "12 12\n", - "13 13\n", - "14 14\n", - "15 15\n", - "16 16\n", - "17 17\n", - "18 18\n", - "19 19\n", + "0 0\n", + "1 1\n", + "2 2\n", + "3 3\n", + "4 4\n", "Name: a, dtype: int64" ] }, @@ -1276,7 +1062,7 @@ } ], "source": [ - "ddf['a'].compute()" + "ddf['a'].head()" ] }, { @@ -1434,7 +1220,7 @@ } ], "source": [ - "ddf.loc[2:5, ['a', 'b']].compute()" + "ddf.loc[2:5, ['a', 'b']].head()" ] }, { @@ -1773,12 +1559,6 @@ " 17\n", " 2\n", " \n", - " \n", - " 3\n", - " 3\n", - " 16\n", - " 3\n", - " \n", " \n", "\n", "" @@ -1787,8 +1567,7 @@ " a b c\n", "0 0 19 0\n", "1 1 18 1\n", - "2 2 17 2\n", - "3 3 16 3" + "2 2 17 2" ] }, "execution_count": 21, @@ -1797,7 +1576,7 @@ } ], "source": [ - "ddf[ddf.b > 15].compute()" + "ddf[ddf.b > 15].head(n=3)" ] }, { @@ -1863,6 +1642,15 @@ "df.query(\"b == 3\")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note here we call `compute()` rather than `head()` on the dask-cuDF dataframe\n", + "since we are happy that the number of matching rows will be small (and hence it\n", + "is reasonable to bring the entire result back)." + ] + }, { "cell_type": "code", "execution_count": 23, @@ -2240,6 +2028,18 @@ "execution_count": 29, "metadata": {}, "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/datasets/lmitchell/mambaforge/envs/cudf-dev/lib/python3.9/site-packages/cudf/core/column_accessor.py:251: FutureWarning: In a future version, the Index constructor will not infer numeric dtypes when passed object-dtype sequences (matching Series behavior)\n", + " result = pd.MultiIndex.from_frame(\n", + "/datasets/lmitchell/mambaforge/envs/cudf-dev/lib/python3.9/site-packages/cudf/core/column_accessor.py:251: FutureWarning: In a future version, the Index constructor will not infer numeric dtypes when passed object-dtype sequences (matching Series behavior)\n", + " result = pd.MultiIndex.from_frame(\n", + "/datasets/lmitchell/mambaforge/envs/cudf-dev/lib/python3.9/site-packages/cudf/core/column_accessor.py:251: FutureWarning: In a future version, the Index constructor will not infer numeric dtypes when passed object-dtype sequences (matching Series behavior)\n", + " result = pd.MultiIndex.from_frame(\n" + ] + }, { "data": { "text/html": [ @@ -2313,7 +2113,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Accessing values of a DataFrame with a MultiIndex. Note that slicing is not yet supported." + "Accessing values of a DataFrame with a MultiIndex, both via `.loc` and `.iloc`." ] }, { @@ -2324,9 +2124,12 @@ { "data": { "text/plain": [ - "first 0.784297\n", - "second 0.793582\n", - "Name: ('b', 3), dtype: float64" + "(first 0.784297\n", + " second 0.793582\n", + " Name: ('b', 3), dtype: float64,\n", + " first second\n", + " a 1 0.082654 0.967955\n", + " 2 0.399417 0.441425)" ] }, "execution_count": 30, @@ -2335,7 +2138,7 @@ } ], "source": [ - "gdf1.loc[('b', 3)]" + "gdf1.loc[('b', 3)], gdf1.iloc[0:2]" ] }, { @@ -2386,11 +2189,9 @@ { "data": { "text/plain": [ - "0 1\n", - "1 2\n", - "2 3\n", - "3 999\n", - "4 4\n", + "0 1\n", + "1 2\n", + "2 3\n", "dtype: int64" ] }, @@ -2400,7 +2201,7 @@ } ], "source": [ - "ds.fillna(999).compute()" + "ds.fillna(999).head(n=3)" ] }, { @@ -2445,6 +2246,15 @@ "s.mean(), s.var()" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This serves as a prototypical example of when we might want to call\n", + "`.compute()`. The result of computing the mean and variance is a single number\n", + "in each case, so it is definitely reasonable to look at the entire result!" + ] + }, { "cell_type": "code", "execution_count": 34, @@ -2530,26 +2340,11 @@ { "data": { "text/plain": [ - "0 10\n", - "1 11\n", - "2 12\n", - "3 13\n", - "4 14\n", - "5 15\n", - "6 16\n", - "7 17\n", - "8 18\n", - "9 19\n", - "10 20\n", - "11 21\n", - "12 22\n", - "13 23\n", - "14 24\n", - "15 25\n", - "16 26\n", - "17 27\n", - "18 28\n", - "19 29\n", + "0 10\n", + "1 11\n", + "2 12\n", + "3 13\n", + "4 14\n", "Name: a, dtype: int64" ] }, @@ -2559,7 +2354,7 @@ } ], "source": [ - "ddf['a'].map_partitions(add_ten).compute()" + "ddf['a'].map_partitions(add_ten).head()" ] }, { @@ -2629,21 +2424,6 @@ "1 1\n", "14 1\n", "2 1\n", - "5 1\n", - "11 1\n", - "7 1\n", - "17 1\n", - "13 1\n", - "8 1\n", - "16 1\n", - "0 1\n", - "10 1\n", - "4 1\n", - "9 1\n", - "19 1\n", - "18 1\n", - "3 1\n", - "12 1\n", "Name: a, dtype: int64" ] }, @@ -2653,7 +2433,7 @@ } ], "source": [ - "ddf.a.value_counts().compute()" + "ddf.a.value_counts().head()" ] }, { @@ -2667,7 +2447,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Like pandas, cuDF provides string processing methods in the `str` attribute of `Series`. Full documentation of string methods is a work in progress. Please see the cuDF API documentation for more information." + "Like pandas, cuDF provides string processing methods in the `str` attribute of `Series`. Full documentation of string methods is a work in progress. Please see the [cuDF API documentation](https://docs.rapids.ai/api/cudf/stable/api_docs/series.html#string-handling) for more information." ] }, { @@ -2712,11 +2492,6 @@ "1 b\n", "2 c\n", "3 aaba\n", - "4 baca\n", - "5 \n", - "6 caba\n", - "7 dog\n", - "8 cat\n", "dtype: object" ] }, @@ -2727,7 +2502,68 @@ ], "source": [ "ds = dask_cudf.from_cudf(s, npartitions=2)\n", - "ds.str.lower().compute()" + "ds.str.lower().head(n=4)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As well as simple manipulation, We can also match strings using [regular expressions](https://docs.rapids.ai/api/cudf/stable/api_docs/api/cudf.core.column.string.StringMethods.match.html)." + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 False\n", + "1 False\n", + "2 False\n", + "3 True\n", + "4 False\n", + "5 \n", + "6 False\n", + "7 False\n", + "8 True\n", + "dtype: bool" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s.str.match(\"^[aAc].+\")" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 False\n", + "1 False\n", + "2 False\n", + "3 True\n", + "4 False\n", + "dtype: bool" + ] + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ds.str.match(\"^[aAc].+\").head()" ] }, { @@ -2746,7 +2582,7 @@ }, { "cell_type": "code", - "execution_count": 41, + "execution_count": 43, "metadata": {}, "outputs": [ { @@ -2765,7 +2601,7 @@ "dtype: int64" ] }, - "execution_count": 41, + "execution_count": 43, "metadata": {}, "output_type": "execute_result" } @@ -2777,33 +2613,26 @@ }, { "cell_type": "code", - "execution_count": 42, + "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "0 1\n", - "1 2\n", - "2 3\n", - "3 \n", - "4 5\n", - "0 1\n", - "1 2\n", - "2 3\n", - "3 \n", - "4 5\n", + "0 1\n", + "1 2\n", + "2 3\n", "dtype: int64" ] }, - "execution_count": 42, + "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds2 = dask_cudf.from_cudf(s, npartitions=2)\n", - "dask_cudf.concat([ds2, ds2]).compute()" + "dask_cudf.concat([ds2, ds2]).head(n=3)" ] }, { @@ -2817,12 +2646,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Performing SQL style merges. Note that the dataframe order is not maintained, but may be restored post-merge by sorting by the index." + "Performing SQL style merges. Note that the dataframe order is **not maintained**, but may be restored post-merge by sorting by the index." ] }, { "cell_type": "code", - "execution_count": 43, + "execution_count": 45, "metadata": {}, "outputs": [ { @@ -2895,7 +2724,7 @@ "4 d 13.0 " ] }, - "execution_count": 43, + "execution_count": 45, "metadata": {}, "output_type": "execute_result" } @@ -2915,7 +2744,7 @@ }, { "cell_type": "code", - "execution_count": 44, + "execution_count": 46, "metadata": {}, "outputs": [ { @@ -2947,30 +2776,24 @@ " \n", " \n", " 0\n", - " a\n", - " 10.0\n", - " 100.0\n", - " \n", - " \n", - " 1\n", " c\n", " 12.0\n", " 101.0\n", " \n", " \n", + " 1\n", + " e\n", + " 14.0\n", + " 102.0\n", + " \n", + " \n", " 2\n", " b\n", " 11.0\n", " <NA>\n", " \n", " \n", - " 0\n", - " e\n", - " 14.0\n", - " 102.0\n", - " \n", - " \n", - " 1\n", + " 3\n", " d\n", " 13.0\n", " <NA>\n", @@ -2981,14 +2804,13 @@ ], "text/plain": [ " key vals_a vals_b\n", - "0 a 10.0 100.0\n", - "1 c 12.0 101.0\n", + "0 c 12.0 101.0\n", + "1 e 14.0 102.0\n", "2 b 11.0 \n", - "0 e 14.0 102.0\n", - "1 d 13.0 " + "3 d 13.0 " ] }, - "execution_count": 44, + "execution_count": 46, "metadata": {}, "output_type": "execute_result" } @@ -2997,7 +2819,7 @@ "ddf_a = dask_cudf.from_cudf(df_a, npartitions=2)\n", "ddf_b = dask_cudf.from_cudf(df_b, npartitions=2)\n", "\n", - "merged = ddf_a.merge(ddf_b, on=['key'], how='left').compute()\n", + "merged = ddf_a.merge(ddf_b, on=['key'], how='left').head(n=4)\n", "merged" ] }, @@ -3012,12 +2834,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Like pandas, cuDF and Dask-cuDF support the Split-Apply-Combine groupby paradigm." + "Like [pandas](https://pandas.pydata.org/docs/user_guide/groupby.html), cuDF and Dask-cuDF support the [Split-Apply-Combine groupby paradigm](https://doi.org/10.18637/jss.v040.i01)." ] }, { "cell_type": "code", - "execution_count": 45, + "execution_count": 47, "metadata": {}, "outputs": [], "source": [ @@ -3036,7 +2858,7 @@ }, { "cell_type": "code", - "execution_count": 46, + "execution_count": 48, "metadata": {}, "outputs": [ { @@ -3099,7 +2921,7 @@ "0 100 90 100 3" ] }, - "execution_count": 46, + "execution_count": 48, "metadata": {}, "output_type": "execute_result" } @@ -3110,7 +2932,7 @@ }, { "cell_type": "code", - "execution_count": 47, + "execution_count": 49, "metadata": {}, "outputs": [ { @@ -3173,7 +2995,7 @@ "0 100 90 100 3" ] }, - "execution_count": 47, + "execution_count": 49, "metadata": {}, "output_type": "execute_result" } @@ -3191,7 +3013,7 @@ }, { "cell_type": "code", - "execution_count": 48, + "execution_count": 50, "metadata": {}, "outputs": [ { @@ -3270,7 +3092,7 @@ "0 1 27 30 27" ] }, - "execution_count": 48, + "execution_count": 50, "metadata": {}, "output_type": "execute_result" } @@ -3281,7 +3103,7 @@ }, { "cell_type": "code", - "execution_count": 49, + "execution_count": 51, "metadata": {}, "outputs": [ { @@ -3360,7 +3182,7 @@ "0 1 27 30 27" ] }, - "execution_count": 49, + "execution_count": 51, "metadata": {}, "output_type": "execute_result" } @@ -3378,7 +3200,7 @@ }, { "cell_type": "code", - "execution_count": 50, + "execution_count": 52, "metadata": {}, "outputs": [ { @@ -3437,7 +3259,7 @@ "0 19 9.0 100" ] }, - "execution_count": 50, + "execution_count": 52, "metadata": {}, "output_type": "execute_result" } @@ -3448,7 +3270,7 @@ }, { "cell_type": "code", - "execution_count": 51, + "execution_count": 53, "metadata": {}, "outputs": [ { @@ -3507,7 +3329,7 @@ "0 19 9.0 100" ] }, - "execution_count": 51, + "execution_count": 53, "metadata": {}, "output_type": "execute_result" } @@ -3532,7 +3354,7 @@ }, { "cell_type": "code", - "execution_count": 52, + "execution_count": 54, "metadata": {}, "outputs": [ { @@ -3587,7 +3409,7 @@ "2 3 6" ] }, - "execution_count": 52, + "execution_count": 54, "metadata": {}, "output_type": "execute_result" } @@ -3599,7 +3421,7 @@ }, { "cell_type": "code", - "execution_count": 53, + "execution_count": 55, "metadata": {}, "outputs": [ { @@ -3651,7 +3473,7 @@ "b 4 5 6" ] }, - "execution_count": 53, + "execution_count": 55, "metadata": {}, "output_type": "execute_result" } @@ -3677,7 +3499,7 @@ }, { "cell_type": "code", - "execution_count": 54, + "execution_count": 56, "metadata": {}, "outputs": [ { @@ -3738,7 +3560,7 @@ "3 2018-11-23 0.103839" ] }, - "execution_count": 54, + "execution_count": 56, "metadata": {}, "output_type": "execute_result" } @@ -3756,7 +3578,7 @@ }, { "cell_type": "code", - "execution_count": 55, + "execution_count": 57, "metadata": {}, "outputs": [ { @@ -3817,7 +3639,7 @@ "3 2018-11-23 0.103839" ] }, - "execution_count": 55, + "execution_count": 57, "metadata": {}, "output_type": "execute_result" } @@ -3844,7 +3666,7 @@ }, { "cell_type": "code", - "execution_count": 56, + "execution_count": 58, "metadata": {}, "outputs": [ { @@ -3917,7 +3739,7 @@ "5 6 e" ] }, - "execution_count": 56, + "execution_count": 58, "metadata": {}, "output_type": "execute_result" } @@ -3930,7 +3752,7 @@ }, { "cell_type": "code", - "execution_count": 57, + "execution_count": 59, "metadata": {}, "outputs": [ { @@ -3974,21 +3796,6 @@ " 3\n", " b\n", " \n", - " \n", - " 3\n", - " 4\n", - " a\n", - " \n", - " \n", - " 4\n", - " 5\n", - " a\n", - " \n", - " \n", - " 5\n", - " 6\n", - " e\n", - " \n", " \n", "\n", "" @@ -3997,20 +3804,17 @@ " id grade\n", "0 1 a\n", "1 2 b\n", - "2 3 b\n", - "3 4 a\n", - "4 5 a\n", - "5 6 e" + "2 3 b" ] }, - "execution_count": 57, + "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dgdf = dask_cudf.from_cudf(gdf, npartitions=2)\n", - "dgdf.compute()" + "dgdf.head(n=3)" ] }, { @@ -4022,7 +3826,7 @@ }, { "cell_type": "code", - "execution_count": 58, + "execution_count": 60, "metadata": {}, "outputs": [ { @@ -4031,7 +3835,7 @@ "StringIndex(['a' 'b' 'e'], dtype='object')" ] }, - "execution_count": 58, + "execution_count": 60, "metadata": {}, "output_type": "execute_result" } @@ -4049,7 +3853,7 @@ }, { "cell_type": "code", - "execution_count": 59, + "execution_count": 61, "metadata": {}, "outputs": [ { @@ -4064,7 +3868,7 @@ "dtype: uint8" ] }, - "execution_count": 59, + "execution_count": 61, "metadata": {}, "output_type": "execute_result" } @@ -4075,7 +3879,7 @@ }, { "cell_type": "code", - "execution_count": 60, + "execution_count": 62, "metadata": {}, "outputs": [ { @@ -4090,7 +3894,7 @@ "dtype: uint8" ] }, - "execution_count": 60, + "execution_count": 62, "metadata": {}, "output_type": "execute_result" } @@ -4123,7 +3927,7 @@ }, { "cell_type": "code", - "execution_count": 61, + "execution_count": 63, "metadata": {}, "outputs": [ { @@ -4208,7 +4012,7 @@ "4 4 15 4 1 0" ] }, - "execution_count": 61, + "execution_count": 63, "metadata": {}, "output_type": "execute_result" } @@ -4217,9 +4021,17 @@ "df.head().to_pandas()" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To convert the first few entries to pandas, we similarly call `.head()` on the\n", + "dask-cuDF dataframe to obtain a local cuDF dataframe, which we can then convert." + ] + }, { "cell_type": "code", - "execution_count": 62, + "execution_count": 64, "metadata": {}, "outputs": [ { @@ -4304,13 +4116,121 @@ "4 4 15 4 1 0" ] }, - "execution_count": 62, + "execution_count": 64, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ddf.head().to_pandas()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In contrast, if we want to convert the entire frame, we need to call\n", + "`.compute()` on `ddf` to get a local cuDF dataframe, and then call\n", + "`to_pandas()`, followed by subsequent processing. This workflow is less\n", + "recommended, since it both puts high memory pressure on a single GPU (the\n", + "`.compute()` call) and does not take advantage of GPU acceleration for\n", + "processing (the computation happens on in pandas)." + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
abcagg_col1agg_col2
0019011
1118100
2217210
3316301
4415410
\n", + "
" + ], + "text/plain": [ + " a b c agg_col1 agg_col2\n", + "0 0 19 0 1 1\n", + "1 1 18 1 0 0\n", + "2 2 17 2 1 0\n", + "3 3 16 3 0 1\n", + "4 4 15 4 1 0" + ] + }, + "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "ddf.compute().head().to_pandas()" + "ddf.compute().to_pandas().head()" ] }, { @@ -4329,7 +4249,7 @@ }, { "cell_type": "code", - "execution_count": 63, + "execution_count": 66, "metadata": {}, "outputs": [ { @@ -4357,7 +4277,7 @@ " [19, 0, 19, 0, 0]])" ] }, - "execution_count": 63, + "execution_count": 66, "metadata": {}, "output_type": "execute_result" } @@ -4368,7 +4288,7 @@ }, { "cell_type": "code", - "execution_count": 64, + "execution_count": 67, "metadata": {}, "outputs": [ { @@ -4396,7 +4316,7 @@ " [19, 0, 19, 0, 0]])" ] }, - "execution_count": 64, + "execution_count": 67, "metadata": {}, "output_type": "execute_result" } @@ -4414,7 +4334,7 @@ }, { "cell_type": "code", - "execution_count": 65, + "execution_count": 68, "metadata": {}, "outputs": [ { @@ -4424,7 +4344,7 @@ " 17, 18, 19])" ] }, - "execution_count": 65, + "execution_count": 68, "metadata": {}, "output_type": "execute_result" } @@ -4435,7 +4355,7 @@ }, { "cell_type": "code", - "execution_count": 66, + "execution_count": 69, "metadata": {}, "outputs": [ { @@ -4445,7 +4365,7 @@ " 17, 18, 19])" ] }, - "execution_count": 66, + "execution_count": 69, "metadata": {}, "output_type": "execute_result" } @@ -4470,7 +4390,7 @@ }, { "cell_type": "code", - "execution_count": 67, + "execution_count": 70, "metadata": {}, "outputs": [ { @@ -4483,14 +4403,14 @@ "agg_col1: int64\n", "agg_col2: int64\n", "----\n", - "a: [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]]\n", - "b: [[19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0]]\n", - "c: [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]]\n", - "agg_col1: [[1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0]]\n", - "agg_col2: [[1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0]]" + "a: [[0,1,2,3,4,...,15,16,17,18,19]]\n", + "b: [[19,18,17,16,15,...,4,3,2,1,0]]\n", + "c: [[0,1,2,3,4,...,15,16,17,18,19]]\n", + "agg_col1: [[1,0,1,0,1,...,0,1,0,1,0]]\n", + "agg_col2: [[1,0,0,1,0,...,1,0,0,1,0]]" ] }, - "execution_count": 67, + "execution_count": 70, "metadata": {}, "output_type": "execute_result" } @@ -4501,7 +4421,7 @@ }, { "cell_type": "code", - "execution_count": 68, + "execution_count": 71, "metadata": {}, "outputs": [ { @@ -4514,20 +4434,20 @@ "agg_col1: int64\n", "agg_col2: int64\n", "----\n", - "a: [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]]\n", - "b: [[19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0]]\n", - "c: [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]]\n", - "agg_col1: [[1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0]]\n", - "agg_col2: [[1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0]]" + "a: [[0,1,2,3,4]]\n", + "b: [[19,18,17,16,15]]\n", + "c: [[0,1,2,3,4]]\n", + "agg_col1: [[1,0,1,0,1]]\n", + "agg_col2: [[1,0,0,1,0]]" ] }, - "execution_count": 68, + "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "ddf.compute().to_arrow()" + "ddf.head().to_arrow()" ] }, { @@ -4554,7 +4474,7 @@ }, { "cell_type": "code", - "execution_count": 69, + "execution_count": 72, "metadata": {}, "outputs": [], "source": [ @@ -4566,7 +4486,7 @@ }, { "cell_type": "code", - "execution_count": 70, + "execution_count": 73, "metadata": {}, "outputs": [], "source": [ @@ -4582,7 +4502,7 @@ }, { "cell_type": "code", - "execution_count": 71, + "execution_count": 74, "metadata": {}, "outputs": [ { @@ -4802,7 +4722,7 @@ "19 19 0 19 0 0" ] }, - "execution_count": 71, + "execution_count": 74, "metadata": {}, "output_type": "execute_result" } @@ -4812,9 +4732,19 @@ "df" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that for the dask-cuDF case, we use `dask_cudf.read_csv` in preference to\n", + "`dask_cudf.from_cudf(cudf.read_csv)` since the former can parallelize across\n", + "multiple GPUs and handle larger CSV files that would fit in memory on a single\n", + "GPU." + ] + }, { "cell_type": "code", - "execution_count": 72, + "execution_count": 75, "metadata": {}, "outputs": [ { @@ -4886,162 +4816,27 @@ " 1\n", " 0\n", " \n", - " \n", - " 5\n", - " 5\n", - " 14\n", - " 5\n", - " 0\n", - " 0\n", - " \n", - " \n", - " 6\n", - " 6\n", - " 13\n", - " 6\n", - " 1\n", - " 1\n", - " \n", - " \n", - " 7\n", - " 7\n", - " 12\n", - " 7\n", - " 0\n", - " 0\n", - " \n", - " \n", - " 8\n", - " 8\n", - " 11\n", - " 8\n", - " 1\n", - " 0\n", - " \n", - " \n", - " 9\n", - " 9\n", - " 10\n", - " 9\n", - " 0\n", - " 1\n", - " \n", - " \n", - " 10\n", - " 10\n", - " 9\n", - " 10\n", - " 1\n", - " 0\n", - " \n", - " \n", - " 11\n", - " 11\n", - " 8\n", - " 11\n", - " 0\n", - " 0\n", - " \n", - " \n", - " 12\n", - " 12\n", - " 7\n", - " 12\n", - " 1\n", - " 1\n", - " \n", - " \n", - " 13\n", - " 13\n", - " 6\n", - " 13\n", - " 0\n", - " 0\n", - " \n", - " \n", - " 14\n", - " 14\n", - " 5\n", - " 14\n", - " 1\n", - " 0\n", - " \n", - " \n", - " 15\n", - " 15\n", - " 4\n", - " 15\n", - " 0\n", - " 1\n", - " \n", - " \n", - " 16\n", - " 16\n", - " 3\n", - " 16\n", - " 1\n", - " 0\n", - " \n", - " \n", - " 17\n", - " 17\n", - " 2\n", - " 17\n", - " 0\n", - " 0\n", - " \n", - " \n", - " 18\n", - " 18\n", - " 1\n", - " 18\n", - " 1\n", - " 1\n", - " \n", - " \n", - " 19\n", - " 19\n", - " 0\n", - " 19\n", - " 0\n", - " 0\n", - " \n", " \n", "\n", "" ], "text/plain": [ - " a b c agg_col1 agg_col2\n", - "0 0 19 0 1 1\n", - "1 1 18 1 0 0\n", - "2 2 17 2 1 0\n", - "3 3 16 3 0 1\n", - "4 4 15 4 1 0\n", - "5 5 14 5 0 0\n", - "6 6 13 6 1 1\n", - "7 7 12 7 0 0\n", - "8 8 11 8 1 0\n", - "9 9 10 9 0 1\n", - "10 10 9 10 1 0\n", - "11 11 8 11 0 0\n", - "12 12 7 12 1 1\n", - "13 13 6 13 0 0\n", - "14 14 5 14 1 0\n", - "15 15 4 15 0 1\n", - "16 16 3 16 1 0\n", - "17 17 2 17 0 0\n", - "18 18 1 18 1 1\n", - "19 19 0 19 0 0" + " a b c agg_col1 agg_col2\n", + "0 0 19 0 1 1\n", + "1 1 18 1 0 0\n", + "2 2 17 2 1 0\n", + "3 3 16 3 0 1\n", + "4 4 15 4 1 0" ] }, - "execution_count": 72, + "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf = dask_cudf.read_csv('example_output/foo_dask.csv')\n", - "ddf.compute()" + "ddf.head()" ] }, { @@ -5053,7 +4848,7 @@ }, { "cell_type": "code", - "execution_count": 73, + "execution_count": 76, "metadata": {}, "outputs": [ { @@ -5077,174 +4872,14 @@ " \n", " \n", " \n", - " a\n", - " b\n", - " c\n", - " agg_col1\n", - " agg_col2\n", - " \n", - " \n", - " \n", - " \n", - " 0\n", - " 0\n", - " 19\n", - " 0\n", - " 1\n", - " 1\n", - " \n", - " \n", - " 1\n", - " 1\n", - " 18\n", - " 1\n", - " 0\n", - " 0\n", - " \n", - " \n", - " 2\n", - " 2\n", - " 17\n", - " 2\n", - " 1\n", - " 0\n", - " \n", - " \n", - " 3\n", - " 3\n", - " 16\n", - " 3\n", - " 0\n", - " 1\n", - " \n", - " \n", - " 4\n", - " 4\n", - " 15\n", - " 4\n", - " 1\n", - " 0\n", - " \n", - " \n", - " 5\n", - " 5\n", - " 14\n", - " 5\n", - " 0\n", - " 0\n", - " \n", - " \n", - " 6\n", - " 6\n", - " 13\n", - " 6\n", - " 1\n", - " 1\n", - " \n", - " \n", - " 7\n", - " 7\n", - " 12\n", - " 7\n", - " 0\n", - " 0\n", - " \n", - " \n", - " 8\n", - " 8\n", - " 11\n", - " 8\n", - " 1\n", - " 0\n", - " \n", - " \n", - " 9\n", - " 9\n", - " 10\n", - " 9\n", - " 0\n", - " 1\n", - " \n", - " \n", - " 10\n", - " 10\n", - " 9\n", - " 10\n", - " 1\n", - " 0\n", - " \n", - " \n", - " 11\n", - " 11\n", - " 8\n", - " 11\n", - " 0\n", - " 0\n", - " \n", - " \n", - " 12\n", - " 12\n", - " 7\n", - " 12\n", - " 1\n", - " 1\n", - " \n", - " \n", - " 13\n", - " 13\n", - " 6\n", - " 13\n", - " 0\n", - " 0\n", - " \n", - " \n", - " 14\n", - " 14\n", - " 5\n", - " 14\n", - " 1\n", - " 0\n", - " \n", - " \n", - " 15\n", - " 15\n", - " 4\n", - " 15\n", - " 0\n", - " 1\n", - " \n", - " \n", - " 16\n", - " 16\n", - " 3\n", - " 16\n", - " 1\n", - " 0\n", - " \n", - " \n", - " 17\n", - " 17\n", - " 2\n", - " 17\n", - " 0\n", - " 0\n", - " \n", - " \n", - " 18\n", - " 18\n", - " 1\n", - " 18\n", - " 1\n", - " 1\n", - " \n", - " \n", - " 19\n", - " 19\n", - " 0\n", - " 19\n", - " 0\n", - " 0\n", + " a\n", + " b\n", + " c\n", + " agg_col1\n", + " agg_col2\n", " \n", + " \n", + " \n", " \n", " 0\n", " 0\n", @@ -5285,182 +4920,27 @@ " 1\n", " 0\n", " \n", - " \n", - " 5\n", - " 5\n", - " 14\n", - " 5\n", - " 0\n", - " 0\n", - " \n", - " \n", - " 6\n", - " 6\n", - " 13\n", - " 6\n", - " 1\n", - " 1\n", - " \n", - " \n", - " 7\n", - " 7\n", - " 12\n", - " 7\n", - " 0\n", - " 0\n", - " \n", - " \n", - " 8\n", - " 8\n", - " 11\n", - " 8\n", - " 1\n", - " 0\n", - " \n", - " \n", - " 9\n", - " 9\n", - " 10\n", - " 9\n", - " 0\n", - " 1\n", - " \n", - " \n", - " 10\n", - " 10\n", - " 9\n", - " 10\n", - " 1\n", - " 0\n", - " \n", - " \n", - " 11\n", - " 11\n", - " 8\n", - " 11\n", - " 0\n", - " 0\n", - " \n", - " \n", - " 12\n", - " 12\n", - " 7\n", - " 12\n", - " 1\n", - " 1\n", - " \n", - " \n", - " 13\n", - " 13\n", - " 6\n", - " 13\n", - " 0\n", - " 0\n", - " \n", - " \n", - " 14\n", - " 14\n", - " 5\n", - " 14\n", - " 1\n", - " 0\n", - " \n", - " \n", - " 15\n", - " 15\n", - " 4\n", - " 15\n", - " 0\n", - " 1\n", - " \n", - " \n", - " 16\n", - " 16\n", - " 3\n", - " 16\n", - " 1\n", - " 0\n", - " \n", - " \n", - " 17\n", - " 17\n", - " 2\n", - " 17\n", - " 0\n", - " 0\n", - " \n", - " \n", - " 18\n", - " 18\n", - " 1\n", - " 18\n", - " 1\n", - " 1\n", - " \n", - " \n", - " 19\n", - " 19\n", - " 0\n", - " 19\n", - " 0\n", - " 0\n", - " \n", " \n", "\n", "" ], "text/plain": [ - " a b c agg_col1 agg_col2\n", - "0 0 19 0 1 1\n", - "1 1 18 1 0 0\n", - "2 2 17 2 1 0\n", - "3 3 16 3 0 1\n", - "4 4 15 4 1 0\n", - "5 5 14 5 0 0\n", - "6 6 13 6 1 1\n", - "7 7 12 7 0 0\n", - "8 8 11 8 1 0\n", - "9 9 10 9 0 1\n", - "10 10 9 10 1 0\n", - "11 11 8 11 0 0\n", - "12 12 7 12 1 1\n", - "13 13 6 13 0 0\n", - "14 14 5 14 1 0\n", - "15 15 4 15 0 1\n", - "16 16 3 16 1 0\n", - "17 17 2 17 0 0\n", - "18 18 1 18 1 1\n", - "19 19 0 19 0 0\n", - "0 0 19 0 1 1\n", - "1 1 18 1 0 0\n", - "2 2 17 2 1 0\n", - "3 3 16 3 0 1\n", - "4 4 15 4 1 0\n", - "5 5 14 5 0 0\n", - "6 6 13 6 1 1\n", - "7 7 12 7 0 0\n", - "8 8 11 8 1 0\n", - "9 9 10 9 0 1\n", - "10 10 9 10 1 0\n", - "11 11 8 11 0 0\n", - "12 12 7 12 1 1\n", - "13 13 6 13 0 0\n", - "14 14 5 14 1 0\n", - "15 15 4 15 0 1\n", - "16 16 3 16 1 0\n", - "17 17 2 17 0 0\n", - "18 18 1 18 1 1\n", - "19 19 0 19 0 0" + " a b c agg_col1 agg_col2\n", + "0 0 19 0 1 1\n", + "1 1 18 1 0 0\n", + "2 2 17 2 1 0\n", + "3 3 16 3 0 1\n", + "4 4 15 4 1 0" ] }, - "execution_count": 73, + "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf = dask_cudf.read_csv('example_output/*.csv')\n", - "ddf.compute()" + "ddf.head()" ] }, { @@ -5474,12 +4954,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Writing to parquet files with GPU-accelerated parquet writer" + "Writing to parquet files with cuDF's GPU-accelerated parquet writer" ] }, { "cell_type": "code", - "execution_count": 74, + "execution_count": 77, "metadata": {}, "outputs": [], "source": [ @@ -5490,12 +4970,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Reading parquet files with a GPU-accelerated parquet reader." + "Reading parquet files with cuDF's GPU-accelerated parquet reader." ] }, { "cell_type": "code", - "execution_count": 75, + "execution_count": 78, "metadata": {}, "outputs": [ { @@ -5715,7 +5195,7 @@ "19 19 0 19 0 0" ] }, - "execution_count": 75, + "execution_count": 78, "metadata": {}, "output_type": "execute_result" } @@ -5729,25 +5209,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Writing to parquet files from a `dask_cudf.DataFrame` using PyArrow under the hood." + "Writing to parquet files from a `dask_cudf.DataFrame` using PyArrow under the\n", + "hood. FIXME: Does this now use cuDF's to_parquet?" ] }, { "cell_type": "code", - "execution_count": 76, + "execution_count": 79, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(None,)" - ] - }, - "execution_count": 76, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "ddf.to_parquet('example_output/ddf_parquet_files')" ] @@ -5768,7 +5238,7 @@ }, { "cell_type": "code", - "execution_count": 77, + "execution_count": 80, "metadata": {}, "outputs": [], "source": [ @@ -5780,7 +5250,7 @@ }, { "cell_type": "code", - "execution_count": 78, + "execution_count": 81, "metadata": {}, "outputs": [ { @@ -5871,7 +5341,7 @@ "1 [{'key': 'chani', 'value': {'int1': 5, 'string... " ] }, - "execution_count": 78, + "execution_count": 81, "metadata": {}, "output_type": "execute_result" } @@ -5888,7 +5358,7 @@ "Dask Performance Tips\n", "--------------------------------\n", "\n", - "Like Apache Spark, Dask operations are [lazy](https://en.wikipedia.org/wiki/Lazy_evaluation). Instead of being executed at that moment, most operations are added to a task graph and the actual evaluation is delayed until the result is needed.\n", + "Like Apache Spark, Dask operations are [lazy](https://en.wikipedia.org/wiki/Lazy_evaluation). Instead of being executed immediately, most operations are added to a task graph and the actual evaluation is delayed until the result is needed.\n", "\n", "Sometimes, though, we want to force the execution of operations. Calling `persist` on a Dask collection fully computes it (or actively computes it in the background), persisting the result into memory. When we're using distributed systems, we may want to wait until `persist` is finished before beginning any downstream operations. We can enforce this contract by using `wait`. Wrapping an operation with `wait` will ensure it doesn't begin executing until all necessary upstream operations have finished.\n", "\n", @@ -5904,14 +5374,45 @@ }, { "cell_type": "code", - "execution_count": 79, + "execution_count": 82, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "2022-05-12 22:41:08,024 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n" + "2022-11-10 01:44:46,045 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-tu6tt139', purging\n", + "2022-11-10 01:44:46,046 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-x67qjxhk', purging\n", + "2022-11-10 01:44:46,046 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-obc257eb', purging\n", + "2022-11-10 01:44:46,046 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-za968vta', purging\n", + "2022-11-10 01:44:46,046 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-p_qsz6vc', purging\n", + "2022-11-10 01:44:46,047 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-4qu6llmc', purging\n", + "2022-11-10 01:44:46,047 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-p8gceq7r', purging\n", + "2022-11-10 01:44:46,047 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-0t0dve42', purging\n", + "2022-11-10 01:44:46,047 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-nz2l7_zg', purging\n", + "2022-11-10 01:44:46,047 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-4pb4d9w9', purging\n", + "2022-11-10 01:44:46,048 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-sshvbh55', purging\n", + "2022-11-10 01:44:46,048 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-z_roztcz', purging\n", + "2022-11-10 01:44:46,048 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-d_popj3t', purging\n", + "2022-11-10 01:44:46,048 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-q0wbi3uy', purging\n", + "2022-11-10 01:44:46,048 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-hyjznxgz', purging\n", + "2022-11-10 01:44:46,049 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-worker-space/worker-hrwh26we', purging\n", + "2022-11-10 01:44:46,049 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize\n", + "2022-11-10 01:44:46,049 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n", + "2022-11-10 01:44:46,049 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize\n", + "2022-11-10 01:44:46,049 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n", + "2022-11-10 01:44:46,064 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize\n", + "2022-11-10 01:44:46,064 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n", + "2022-11-10 01:44:46,071 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize\n", + "2022-11-10 01:44:46,071 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n", + "2022-11-10 01:44:46,143 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize\n", + "2022-11-10 01:44:46,143 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n", + "2022-11-10 01:44:46,202 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize\n", + "2022-11-10 01:44:46,202 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n", + "2022-11-10 01:44:46,225 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize\n", + "2022-11-10 01:44:46,225 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n", + "2022-11-10 01:44:46,257 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize\n", + "2022-11-10 01:44:46,258 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize\n" ] } ], @@ -5935,7 +5436,7 @@ }, { "cell_type": "code", - "execution_count": 80, + "execution_count": 83, "metadata": {}, "outputs": [ { @@ -6005,13 +5506,13 @@ " \n", "\n", "\n", - "
Dask Name: assign, 20 tasks
" + "
Dask Name: assign, 4 graph layers
" ], "text/plain": [ "" ] }, - "execution_count": 80, + "execution_count": 83, "metadata": {}, "output_type": "execute_result" } @@ -6027,23 +5528,51 @@ }, { "cell_type": "code", - "execution_count": 81, + "execution_count": 84, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Thu May 12 22:41:08 2022 \n", + "Thu Nov 10 01:44:49 2022 \n", "+-----------------------------------------------------------------------------+\n", - "| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |\n", + "| NVIDIA-SMI 510.73.08 Driver Version: 510.73.08 CUDA Version: 11.6 |\n", "|-------------------------------+----------------------+----------------------+\n", "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", "| | | MIG M. |\n", "|===============================+======================+======================|\n", - "| 0 NVIDIA RTX A6000 On | 00000000:65:00.0 On | Off |\n", - "| 30% 41C P2 77W / 300W | 1380MiB / 49140MiB | 2% Default |\n", + "| 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 |\n", + "| N/A 32C P0 55W / 300W | 5197MiB / 32768MiB | 0% Default |\n", + "| | | N/A |\n", + "+-------------------------------+----------------------+----------------------+\n", + "| 1 Tesla V100-SXM2... On | 00000000:07:00.0 Off | 0 |\n", + "| N/A 32C P0 55W / 300W | 336MiB / 32768MiB | 0% Default |\n", + "| | | N/A |\n", + "+-------------------------------+----------------------+----------------------+\n", + "| 2 Tesla V100-SXM2... On | 00000000:0A:00.0 Off | 0 |\n", + "| N/A 33C P0 55W / 300W | 336MiB / 32768MiB | 0% Default |\n", + "| | | N/A |\n", + "+-------------------------------+----------------------+----------------------+\n", + "| 3 Tesla V100-SXM2... On | 00000000:0B:00.0 Off | 0 |\n", + "| N/A 31C P0 55W / 300W | 336MiB / 32768MiB | 0% Default |\n", + "| | | N/A |\n", + "+-------------------------------+----------------------+----------------------+\n", + "| 4 Tesla V100-SXM2... On | 00000000:85:00.0 Off | 0 |\n", + "| N/A 32C P0 54W / 300W | 336MiB / 32768MiB | 0% Default |\n", + "| | | N/A |\n", + "+-------------------------------+----------------------+----------------------+\n", + "| 5 Tesla V100-SXM2... On | 00000000:86:00.0 Off | 0 |\n", + "| N/A 33C P0 56W / 300W | 336MiB / 32768MiB | 0% Default |\n", + "| | | N/A |\n", + "+-------------------------------+----------------------+----------------------+\n", + "| 6 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 |\n", + "| N/A 35C P0 55W / 300W | 336MiB / 32768MiB | 0% Default |\n", + "| | | N/A |\n", + "+-------------------------------+----------------------+----------------------+\n", + "| 7 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 |\n", + "| N/A 32C P0 54W / 300W | 336MiB / 32768MiB | 0% Default |\n", "| | | N/A |\n", "+-------------------------------+----------------------+----------------------+\n", " \n", @@ -6052,12 +5581,19 @@ "| GPU GI CI PID Type Process name GPU Memory |\n", "| ID ID Usage |\n", "|=============================================================================|\n", - "| 0 N/A N/A 1674 G 159MiB |\n", - "| 0 N/A N/A 1950 G 47MiB |\n", - "| 0 N/A N/A 13521 G 132MiB |\n", - "| 0 N/A N/A 304797 G 36MiB |\n", - "| 0 N/A N/A 488366 C 743MiB |\n", - "| 0 N/A N/A 488425 C 257MiB |\n", + "| 0 N/A N/A 564 C .../envs/cudf-dev/bin/python 861MiB |\n", + "| 0 N/A N/A 1891 C .../envs/cudf-dev/bin/python 333MiB |\n", + "| 0 N/A N/A 12797 C ...a3/envs/sql/bin/python3.9 1649MiB |\n", + "| 0 N/A N/A 64399 C ...a3/envs/sql/bin/python3.9 949MiB |\n", + "| 0 N/A N/A 68311 C ...vs/cudf_dev/bin/python3.9 741MiB |\n", + "| 0 N/A N/A 80212 C ...vs/cudf-dev/bin/python3.9 659MiB |\n", + "| 1 N/A N/A 1878 C .../envs/cudf-dev/bin/python 333MiB |\n", + "| 2 N/A N/A 1871 C .../envs/cudf-dev/bin/python 333MiB |\n", + "| 3 N/A N/A 1884 C .../envs/cudf-dev/bin/python 333MiB |\n", + "| 4 N/A N/A 1880 C .../envs/cudf-dev/bin/python 333MiB |\n", + "| 5 N/A N/A 1873 C .../envs/cudf-dev/bin/python 333MiB |\n", + "| 6 N/A N/A 1889 C .../envs/cudf-dev/bin/python 333MiB |\n", + "| 7 N/A N/A 1883 C .../envs/cudf-dev/bin/python 333MiB |\n", "+-----------------------------------------------------------------------------+\n" ] } @@ -6075,7 +5611,7 @@ }, { "cell_type": "code", - "execution_count": 82, + "execution_count": 85, "metadata": {}, "outputs": [ { @@ -6145,13 +5681,13 @@ " \n", "\n", "\n", - "
Dask Name: assign, 5 tasks
" + "
Dask Name: assign, 1 graph layer
" ], "text/plain": [ "" ] }, - "execution_count": 82, + "execution_count": 85, "metadata": {}, "output_type": "execute_result" } @@ -6163,38 +5699,73 @@ }, { "cell_type": "code", - "execution_count": 83, + "execution_count": 86, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Thu May 12 22:41:14 2022 \r\n", - "+-----------------------------------------------------------------------------+\r\n", - "| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |\r\n", - "|-------------------------------+----------------------+----------------------+\r\n", - "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\r\n", - "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\r\n", - "| | | MIG M. |\r\n", - "|===============================+======================+======================|\r\n", - "| 0 NVIDIA RTX A6000 On | 00000000:65:00.0 On | Off |\r\n", - "| 30% 42C P2 77W / 300W | 1942MiB / 49140MiB | 0% Default |\r\n", - "| | | N/A |\r\n", - "+-------------------------------+----------------------+----------------------+\r\n", - " \r\n", - "+-----------------------------------------------------------------------------+\r\n", - "| Processes: |\r\n", - "| GPU GI CI PID Type Process name GPU Memory |\r\n", - "| ID ID Usage |\r\n", - "|=============================================================================|\r\n", - "| 0 N/A N/A 1674 G 159MiB |\r\n", - "| 0 N/A N/A 1950 G 47MiB |\r\n", - "| 0 N/A N/A 13521 G 132MiB |\r\n", - "| 0 N/A N/A 304797 G 36MiB |\r\n", - "| 0 N/A N/A 488366 C 743MiB |\r\n", - "| 0 N/A N/A 488425 C 819MiB |\r\n", - "+-----------------------------------------------------------------------------+\r\n" + "Thu Nov 10 01:44:55 2022 \n", + "+-----------------------------------------------------------------------------+\n", + "| NVIDIA-SMI 510.73.08 Driver Version: 510.73.08 CUDA Version: 11.6 |\n", + "|-------------------------------+----------------------+----------------------+\n", + "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\n", + "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\n", + "| | | MIG M. |\n", + "|===============================+======================+======================|\n", + "| 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 |\n", + "| N/A 32C P0 55W / 300W | 5571MiB / 32768MiB | 0% Default |\n", + "| | | N/A |\n", + "+-------------------------------+----------------------+----------------------+\n", + "| 1 Tesla V100-SXM2... On | 00000000:07:00.0 Off | 0 |\n", + "| N/A 32C P0 56W / 300W | 336MiB / 32768MiB | 0% Default |\n", + "| | | N/A |\n", + "+-------------------------------+----------------------+----------------------+\n", + "| 2 Tesla V100-SXM2... On | 00000000:0A:00.0 Off | 0 |\n", + "| N/A 33C P0 55W / 300W | 710MiB / 32768MiB | 0% Default |\n", + "| | | N/A |\n", + "+-------------------------------+----------------------+----------------------+\n", + "| 3 Tesla V100-SXM2... On | 00000000:0B:00.0 Off | 0 |\n", + "| N/A 31C P0 55W / 300W | 336MiB / 32768MiB | 0% Default |\n", + "| | | N/A |\n", + "+-------------------------------+----------------------+----------------------+\n", + "| 4 Tesla V100-SXM2... On | 00000000:85:00.0 Off | 0 |\n", + "| N/A 32C P0 54W / 300W | 710MiB / 32768MiB | 0% Default |\n", + "| | | N/A |\n", + "+-------------------------------+----------------------+----------------------+\n", + "| 5 Tesla V100-SXM2... On | 00000000:86:00.0 Off | 0 |\n", + "| N/A 33C P0 57W / 300W | 710MiB / 32768MiB | 0% Default |\n", + "| | | N/A |\n", + "+-------------------------------+----------------------+----------------------+\n", + "| 6 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 |\n", + "| N/A 35C P0 55W / 300W | 710MiB / 32768MiB | 0% Default |\n", + "| | | N/A |\n", + "+-------------------------------+----------------------+----------------------+\n", + "| 7 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 |\n", + "| N/A 32C P0 54W / 300W | 336MiB / 32768MiB | 0% Default |\n", + "| | | N/A |\n", + "+-------------------------------+----------------------+----------------------+\n", + " \n", + "+-----------------------------------------------------------------------------+\n", + "| Processes: |\n", + "| GPU GI CI PID Type Process name GPU Memory |\n", + "| ID ID Usage |\n", + "|=============================================================================|\n", + "| 0 N/A N/A 564 C .../envs/cudf-dev/bin/python 861MiB |\n", + "| 0 N/A N/A 1891 C .../envs/cudf-dev/bin/python 707MiB |\n", + "| 0 N/A N/A 12797 C ...a3/envs/sql/bin/python3.9 1649MiB |\n", + "| 0 N/A N/A 64399 C ...a3/envs/sql/bin/python3.9 949MiB |\n", + "| 0 N/A N/A 68311 C ...vs/cudf_dev/bin/python3.9 741MiB |\n", + "| 0 N/A N/A 80212 C ...vs/cudf-dev/bin/python3.9 659MiB |\n", + "| 1 N/A N/A 1878 C .../envs/cudf-dev/bin/python 333MiB |\n", + "| 2 N/A N/A 1871 C .../envs/cudf-dev/bin/python 707MiB |\n", + "| 3 N/A N/A 1884 C .../envs/cudf-dev/bin/python 333MiB |\n", + "| 4 N/A N/A 1880 C .../envs/cudf-dev/bin/python 707MiB |\n", + "| 5 N/A N/A 1873 C .../envs/cudf-dev/bin/python 707MiB |\n", + "| 6 N/A N/A 1889 C .../envs/cudf-dev/bin/python 707MiB |\n", + "| 7 N/A N/A 1883 C .../envs/cudf-dev/bin/python 333MiB |\n", + "+-----------------------------------------------------------------------------+\n" ] } ], @@ -6222,7 +5793,7 @@ }, { "cell_type": "code", - "execution_count": 84, + "execution_count": 87, "metadata": {}, "outputs": [], "source": [ @@ -6247,7 +5818,7 @@ }, { "cell_type": "code", - "execution_count": 85, + "execution_count": 88, "metadata": {}, "outputs": [], "source": [ @@ -6264,16 +5835,16 @@ }, { "cell_type": "code", - "execution_count": 86, + "execution_count": 89, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "DoneAndNotDoneFutures(done={, , , , }, not_done=set())" + "DoneAndNotDoneFutures(done={, , , , }, not_done=set())" ] }, - "execution_count": 86, + "execution_count": 89, "metadata": {}, "output_type": "execute_result" } @@ -6286,14 +5857,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "With `wait`, we can safely proceed on in our workflow." + "With `wait` completed, we can safely proceed on in our workflow." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "Python 3.9.13 ('cudf-dev')", "language": "python", "name": "python3" }, @@ -6307,7 +5878,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.12" + "version": "3.9.13" }, "toc": { "base_numbering": 1, @@ -6321,6 +5892,11 @@ "toc_position": {}, "toc_section_display": true, "toc_window_display": false + }, + "vscode": { + "interpreter": { + "hash": "8056d08c5310318d9ca4fe60778daf853f02695d9fa19fd0f51ce5f8b089487a" + } } }, "nbformat": 4,