forked from modin-project/modin
-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
DOCS-modin-project#2334: Add tutorials to main repo
Signed-off-by: Devin Petersohn <[email protected]>
- Loading branch information
1 parent
9d3ead8
commit 2707dd3
Showing
16 changed files
with
1,565 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
FROM continuumio/miniconda3 | ||
|
||
RUN conda install -c conda-forge psutil setproctitle | ||
RUN pip install -r requirements.txt | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
# modin-tutorial | ||
Tutorial for how to use different features Modin |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
fsspec | ||
s3fs | ||
ray==1.0.0 | ||
jupyterlab | ||
git+https://github.com/modin-project/modin |
146 changes: 146 additions & 0 deletions
146
examples/tutorial/tutorial_notebooks/cluster/exercise_4.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,146 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"![LOGO](../img/MODIN_ver2_hrz.png)\n", | ||
"\n", | ||
"<center><h2>Scale your pandas workflows by changing one line of code</h2>\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Exercise 4: Setting up cluster environment\n", | ||
"\n", | ||
"**GOAL**: Learn how to set up a cluster for Modin.\n", | ||
"\n", | ||
"**NOTE**: This exercise has extra requirements. Read instructions carefully before attempting. \n", | ||
"\n", | ||
"**This exercise instructs the user on how to start a 700+ core cluster, and it is not shut down until the end of Exercise 5. Read instructions carefully.**" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Often in practice we have a need to exceed the capabilities of a single machine. Modin works and performs well in both local mode and in a cluster environment. The key advantage of Modin is that your notebook does not change between local development and cluster execution. Users are not required to think about how many workers exist or how to distribute and partition their data; Modin handles all of this seamlessly and transparently.\n", | ||
"\n", | ||
"![Cluster](../img/modin_cluster.png)\n", | ||
"\n", | ||
"**Extra Requirements for this exercise**\n", | ||
"\n", | ||
"Detailed instructions can be found here: https://docs.ray.io/en/master/cluster/launcher.html\n", | ||
"\n", | ||
"From command line:\n", | ||
"- `pip install boto3`\n", | ||
"- `aws configure`\n", | ||
"- `ray up modin-cluster.yaml`\n", | ||
"\n", | ||
"Included in this directory is a file named `modin-cluster.yaml`. We will use this to start the cluster." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# !pip install boto3" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# !aws configure" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Starting and connecting to the cluster\n", | ||
"\n", | ||
"This example starts 1 head node (m5.24xlarge) and 7 workers (m5.24xlarge), 768 total CPUs.\n", | ||
"\n", | ||
"Cost of this cluster can be found here: https://aws.amazon.com/ec2/pricing/on-demand/." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# !ray up modin-cluster.yaml" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Connect to the cluster with `ray attach`" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# !ray attach modin-cluster.yaml" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# DO NOT CHANGE THIS CODE!\n", | ||
"# Changing this code risks breaking further exercises\n", | ||
"\n", | ||
"import time\n", | ||
"time.sleep(600) # We need to give ray enough time to start up all the workers\n", | ||
"import ray\n", | ||
"ray.init(address=\"auto\")\n", | ||
"import modin.pandas as pd\n", | ||
"assert pd.DEFAULT_NPARTITIONS == 768, \"Not all Ray nodes are started up yet\"\n", | ||
"ray.shutdown()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Please move on to Exercise 5" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.6.8" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
184 changes: 184 additions & 0 deletions
184
examples/tutorial/tutorial_notebooks/cluster/exercise_5.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,184 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"![LOGO](../img/MODIN_ver2_hrz.png)\n", | ||
"\n", | ||
"<center><h2>Scale your pandas workflows by changing one line of code</h2>\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Exercise 5: Executing on a cluster environment\n", | ||
"\n", | ||
"**GOAL**: Learn how to connect Modin to a Ray cluster and run pandas queries on a cluster.\n", | ||
"\n", | ||
"**NOTE**: Exercise 4 must be completed first, this exercise relies on the cluster created in Exercise 4." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Modin performance scales as the number of nodes and cores increases. In this exercise, we will reproduce the data from the plot below.\n", | ||
"\n", | ||
"![ClusterPerf](../img/modin_cluster_perf.png)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Don't change this cell!\n", | ||
"import ray\n", | ||
"ray.init(address=\"auto\")\n", | ||
"import modin.pandas as pd\n", | ||
"if pd.DEFAULT_NPARTITIONS != 768:\n", | ||
" print(\"This notebook was designed and tested for an 8 node Ray cluster. \"\n", | ||
" \"Proceed at your own risk!\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!du -h big_yellow.csv" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%time\n", | ||
"df = pd.read_csv(\"big_yellow.csv\", quoting=3)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%time\n", | ||
"count_result = df.count()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# print\n", | ||
"count_result" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%time\n", | ||
"groupby_result = df.groupby(\"passenger_count\").count()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# print\n", | ||
"groupby_result" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%time\n", | ||
"apply_result = df.applymap(str)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# print\n", | ||
"apply_result" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"ray.shutdown()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Shutting down the cluster\n", | ||
"\n", | ||
"**You may have to change the path below**. If this does not work, log in to your \n", | ||
"\n", | ||
"Now that we have finished computation, we can shut down the cluster with `ray down`." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!ray down modin-cluster.yaml" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### This ends the cluster exercise" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.6.8" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.