diff --git a/examples/tutorial/Dockerfile b/examples/tutorial/Dockerfile new file mode 100644 index 00000000000..b8c3a367803 --- /dev/null +++ b/examples/tutorial/Dockerfile @@ -0,0 +1,5 @@ +FROM continuumio/miniconda3 + +RUN conda install -c conda-forge psutil setproctitle +RUN pip install -r requirements.txt + diff --git a/examples/tutorial/README.md b/examples/tutorial/README.md new file mode 100644 index 00000000000..660ad8c0881 --- /dev/null +++ b/examples/tutorial/README.md @@ -0,0 +1,2 @@ +# modin-tutorial +Tutorial for how to use different features Modin diff --git a/examples/tutorial/requirements.txt b/examples/tutorial/requirements.txt new file mode 100644 index 00000000000..7a4bc510d5d --- /dev/null +++ b/examples/tutorial/requirements.txt @@ -0,0 +1,5 @@ +fsspec +s3fs +ray==1.0.0 +jupyterlab +git+https://github.com/modin-project/modin diff --git a/examples/tutorial/tutorial_notebooks/cluster/exercise_4.ipynb b/examples/tutorial/tutorial_notebooks/cluster/exercise_4.ipynb new file mode 100644 index 00000000000..341d2a23010 --- /dev/null +++ b/examples/tutorial/tutorial_notebooks/cluster/exercise_4.ipynb @@ -0,0 +1,146 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![LOGO](../img/MODIN_ver2_hrz.png)\n", + "\n", + "

Scale your pandas workflows by changing one line of code

\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Exercise 4: Setting up cluster environment\n", + "\n", + "**GOAL**: Learn how to set up a cluster for Modin.\n", + "\n", + "**NOTE**: This exercise has extra requirements. Read instructions carefully before attempting. \n", + "\n", + "**This exercise instructs the user on how to start a 700+ core cluster, and it is not shut down until the end of Exercise 5. Read instructions carefully.**" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Often in practice we have a need to exceed the capabilities of a single machine. Modin works and performs well in both local mode and in a cluster environment. The key advantage of Modin is that your notebook does not change between local development and cluster execution. Users are not required to think about how many workers exist or how to distribute and partition their data; Modin handles all of this seamlessly and transparently.\n", + "\n", + "![Cluster](../img/modin_cluster.png)\n", + "\n", + "**Extra Requirements for this exercise**\n", + "\n", + "Detailed instructions can be found here: https://docs.ray.io/en/master/cluster/launcher.html\n", + "\n", + "From command line:\n", + "- `pip install boto3`\n", + "- `aws configure`\n", + "- `ray up modin-cluster.yaml`\n", + "\n", + "Included in this directory is a file named `modin-cluster.yaml`. We will use this to start the cluster." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# !pip install boto3" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# !aws configure" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Starting and connecting to the cluster\n", + "\n", + "This example starts 1 head node (m5.24xlarge) and 7 workers (m5.24xlarge), 768 total CPUs.\n", + "\n", + "Cost of this cluster can be found here: https://aws.amazon.com/ec2/pricing/on-demand/." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# !ray up modin-cluster.yaml" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Connect to the cluster with `ray attach`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# !ray attach modin-cluster.yaml" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# DO NOT CHANGE THIS CODE!\n", + "# Changing this code risks breaking further exercises\n", + "\n", + "import time\n", + "time.sleep(600) # We need to give ray enough time to start up all the workers\n", + "import ray\n", + "ray.init(address=\"auto\")\n", + "import modin.pandas as pd\n", + "assert pd.DEFAULT_NPARTITIONS == 768, \"Not all Ray nodes are started up yet\"\n", + "ray.shutdown()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Please move on to Exercise 5" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/examples/tutorial/tutorial_notebooks/cluster/exercise_5.ipynb b/examples/tutorial/tutorial_notebooks/cluster/exercise_5.ipynb new file mode 100644 index 00000000000..f4e4e82dcca --- /dev/null +++ b/examples/tutorial/tutorial_notebooks/cluster/exercise_5.ipynb @@ -0,0 +1,184 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![LOGO](../img/MODIN_ver2_hrz.png)\n", + "\n", + "

Scale your pandas workflows by changing one line of code

\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Exercise 5: Executing on a cluster environment\n", + "\n", + "**GOAL**: Learn how to connect Modin to a Ray cluster and run pandas queries on a cluster.\n", + "\n", + "**NOTE**: Exercise 4 must be completed first, this exercise relies on the cluster created in Exercise 4." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Modin performance scales as the number of nodes and cores increases. In this exercise, we will reproduce the data from the plot below.\n", + "\n", + "![ClusterPerf](../img/modin_cluster_perf.png)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Don't change this cell!\n", + "import ray\n", + "ray.init(address=\"auto\")\n", + "import modin.pandas as pd\n", + "if pd.DEFAULT_NPARTITIONS != 768:\n", + " print(\"This notebook was designed and tested for an 8 node Ray cluster. \"\n", + " \"Proceed at your own risk!\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!du -h big_yellow.csv" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "df = pd.read_csv(\"big_yellow.csv\", quoting=3)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "count_result = df.count()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# print\n", + "count_result" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "groupby_result = df.groupby(\"passenger_count\").count()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# print\n", + "groupby_result" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "apply_result = df.applymap(str)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# print\n", + "apply_result" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ray.shutdown()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Shutting down the cluster\n", + "\n", + "**You may have to change the path below**. If this does not work, log in to your \n", + "\n", + "Now that we have finished computation, we can shut down the cluster with `ray down`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!ray down modin-cluster.yaml" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### This ends the cluster exercise" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/examples/tutorial/tutorial_notebooks/cluster/modin-cluster.yaml b/examples/tutorial/tutorial_notebooks/cluster/modin-cluster.yaml new file mode 100644 index 00000000000..78b3be39daa --- /dev/null +++ b/examples/tutorial/tutorial_notebooks/cluster/modin-cluster.yaml @@ -0,0 +1,163 @@ +# An unique identifier for the head node and workers of this cluster. +cluster_name: modin_init + +# The minimum number of workers nodes to launch in addition to the head +# node. This number should be >= 0. +min_workers: 7 + +# The maximum number of workers nodes to launch in addition to the head +# node. This takes precedence over min_workers. +max_workers: 7 + +# The initial number of worker nodes to launch in addition to the head +# node. When the cluster is first brought up (or when it is refreshed with a +# subsequent `ray up`) this number of nodes will be started. +initial_workers: 7 + +# Whether or not to autoscale aggressively. If this is enabled, if at any point +# we would start more workers, we start at least enough to bring us to +# initial_workers. +autoscaling_mode: default + +# This executes all commands on all nodes in the docker container, +# and opens all the necessary ports to support the Ray cluster. +# Empty string means disabled. +docker: + image: "" # e.g., rayproject/ray:0.8.7 + container_name: "" # e.g. ray_docker + # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image + # if no cached version is present. + pull_before_run: True + run_options: [] # Extra options to pass into "docker run" + + # Example of running a GPU head with CPU workers + # head_image: "rayproject/ray:0.8.7-gpu" + # head_run_options: + # - --runtime=nvidia + + # worker_image: "rayproject/ray:0.8.7" + # worker_run_options: [] + +# The autoscaler will scale up the cluster to this target fraction of resource +# usage. For example, if a cluster of 10 nodes is 100% busy and +# target_utilization is 0.8, it would resize the cluster to 13. This fraction +# can be decreased to increase the aggressiveness of upscaling. +# This max value allowed is 1.0, which is the most conservative setting. +target_utilization_fraction: 0.8 + +# If a node is idle for this many minutes, it will be removed. +idle_timeout_minutes: 5 + +# Cloud-provider specific configuration. +provider: + type: aws + region: us-west-2 + # Availability zone(s), comma-separated, that nodes may be launched in. + # Nodes are currently spread between zones by a round-robin approach, + # however this implementation detail should not be relied upon. + availability_zone: us-west-2a,us-west-2b + # Whether to allow node reuse. If set to False, nodes will be terminated + # instead of stopped. + cache_stopped_nodes: True # If not present, the default is True. + +# How Ray will authenticate with newly launched nodes. +auth: + ssh_user: ubuntu +# By default Ray creates a new private keypair, but you can also use your own. +# If you do so, make sure to also set "KeyName" in the head and worker node +# configurations below. +# ssh_private_key: /path/to/your/key.pem + +# Provider-specific config for the head node, e.g. instance type. By default +# Ray will auto-configure unspecified fields such as SubnetId and KeyName. +# For more documentation on available fields, see: +# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances +head_node: + InstanceType: m5.24xlarge + ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30 + + # You can provision additional disk space with a conf as follows + BlockDeviceMappings: + - DeviceName: /dev/sda1 + Ebs: + VolumeSize: 500 + + # Additional options in the boto docs. + +# Provider-specific config for worker nodes, e.g. instance type. By default +# Ray will auto-configure unspecified fields such as SubnetId and KeyName. +# For more documentation on available fields, see: +# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances +worker_nodes: + InstanceType: m5.24xlarge + ImageId: ami-0a2363a9cff180a64 # Deep Learning AMI (Ubuntu) Version 30 + + BlockDeviceMappings: + - DeviceName: /dev/sda1 + Ebs: + VolumeSize: 500 + # Run workers on spot by default. Comment this out to use on-demand. + # InstanceMarketOptions: + # MarketType: spot + # Additional options can be found in the boto docs, e.g. + # SpotOptions: + # MaxPrice: MAX_HOURLY_PRICE + + # Additional options in the boto docs. + +# Files or directories to copy to the head and worker nodes. The format is a +# dictionary from REMOTE_PATH: LOCAL_PATH, e.g. +file_mounts: { +# "/path1/on/remote/machine": "/path1/on/local/machine", +# "/path2/on/remote/machine": "/path2/on/local/machine", +} + +# Files or directories to copy from the head node to the worker nodes. The format is a +# list of paths. The same path on the head node will be copied to the worker node. +# This behavior is a subset of the file_mounts behavior. In the vast majority of cases +# you should just use file_mounts. Only use this if you know what you're doing! +cluster_synced_files: [] + +# Whether changes to directories in file_mounts or cluster_synced_files in the head node +# should sync to the worker node continuously +file_mounts_sync_continuously: False + +# List of commands that will be run before `setup_commands`. If docker is +# enabled, these commands will run outside the container and before docker +# is setup. +initialization_commands: [] + +# List of shell commands to run to set up nodes. +setup_commands: + # Note: if you're developing Ray, you probably want to create an AMI that + # has your Ray repo pre-cloned. Then, you can replace the pip installs + # below with a git checkout (and possibly a recompile). + - echo 'export PATH="$HOME/anaconda3/envs/tensorflow_p36/bin:$PATH"' >> ~/.bashrc + - pip install modin + - pip install ray==1.0.0 + - pip install pyarrow==0.16 + - pip install -U fsspec + - wget https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-01.csv + - printf "VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount\n" > big_yellow.csv + - tail -n +2 yellow_tripdata_2015-01.csv{,}{,}{,}{,}{,}{,} >> big_yellow.csv + # Consider uncommenting these if you also want to run apt-get commands during setup + # - sudo pkill -9 apt-get || true + # - sudo pkill -9 dpkg || true + # - sudo dpkg --configure -a + +# Custom commands that will be run on the head node after common setup. +head_setup_commands: + - pip install boto3==1.4.8 # 1.4.8 adds InstanceMarketOptions + +# Custom commands that will be run on worker nodes after common setup. +worker_setup_commands: [] + +# Command to start ray on the head node. You don't need to change this. +head_start_ray_commands: + - ray stop + - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml + +# Command to start ray on worker nodes. You don't need to change this. +worker_start_ray_commands: + - ray stop + - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 diff --git a/examples/tutorial/tutorial_notebooks/img/MODIN_ver2_hrz.png b/examples/tutorial/tutorial_notebooks/img/MODIN_ver2_hrz.png new file mode 100644 index 00000000000..6276bd6168c Binary files /dev/null and b/examples/tutorial/tutorial_notebooks/img/MODIN_ver2_hrz.png differ diff --git a/examples/tutorial/tutorial_notebooks/img/convert_to_pandas.png b/examples/tutorial/tutorial_notebooks/img/convert_to_pandas.png new file mode 100644 index 00000000000..1ba62de95ce Binary files /dev/null and b/examples/tutorial/tutorial_notebooks/img/convert_to_pandas.png differ diff --git a/examples/tutorial/tutorial_notebooks/img/modin_cluster.png b/examples/tutorial/tutorial_notebooks/img/modin_cluster.png new file mode 100644 index 00000000000..7bfb190b072 Binary files /dev/null and b/examples/tutorial/tutorial_notebooks/img/modin_cluster.png differ diff --git a/examples/tutorial/tutorial_notebooks/img/modin_cluster_perf.png b/examples/tutorial/tutorial_notebooks/img/modin_cluster_perf.png new file mode 100644 index 00000000000..d35e2411c19 Binary files /dev/null and b/examples/tutorial/tutorial_notebooks/img/modin_cluster_perf.png differ diff --git a/examples/tutorial/tutorial_notebooks/img/modin_multicore.png b/examples/tutorial/tutorial_notebooks/img/modin_multicore.png new file mode 100644 index 00000000000..9dcd0bbfdf2 Binary files /dev/null and b/examples/tutorial/tutorial_notebooks/img/modin_multicore.png differ diff --git a/examples/tutorial/tutorial_notebooks/img/pandas_multicore.png b/examples/tutorial/tutorial_notebooks/img/pandas_multicore.png new file mode 100644 index 00000000000..a56c4279848 Binary files /dev/null and b/examples/tutorial/tutorial_notebooks/img/pandas_multicore.png differ diff --git a/examples/tutorial/tutorial_notebooks/img/read_csv_perf.png b/examples/tutorial/tutorial_notebooks/img/read_csv_perf.png new file mode 100644 index 00000000000..7e5f7e8ff63 Binary files /dev/null and b/examples/tutorial/tutorial_notebooks/img/read_csv_perf.png differ diff --git a/examples/tutorial/tutorial_notebooks/introduction/exercise_1.ipynb b/examples/tutorial/tutorial_notebooks/introduction/exercise_1.ipynb new file mode 100644 index 00000000000..a2b1de7034b --- /dev/null +++ b/examples/tutorial/tutorial_notebooks/introduction/exercise_1.ipynb @@ -0,0 +1,232 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![LOGO](../img/MODIN_ver2_hrz.png)\n", + "\n", + "

Scale your pandas workflows by changing one line of code

\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Exercise 1: How to use Modin\n", + "\n", + "**GOAL**: Learn how to import Modin to accelerate and scale pandas workflows." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Modin is a drop-in replacement for pandas that distributes the computation \n", + "across all of the cores in your machine or in a cluster.\n", + "In practical terms, this means that you can continue using the same pandas scripts\n", + "as before and expect the behavior and results to be the same. The only thing that needs\n", + "to change is the import statement. Normally, you would change:\n", + "\n", + "```python\n", + "import pandas as pd\n", + "```\n", + "\n", + "to:\n", + "\n", + "```python\n", + "import modin.pandas as pd\n", + "```\n", + "\n", + "Changing this line of code will allow you to use all of the cores in your machine to do computation on your data. One of the major performance bottlenecks of pandas is that it only uses a single core for any given computation. Modin exposes an API that is identical to Pandas, allowing you to continue interacting with your data as you would with Pandas. There are no additional commands required to use Modin locally. Partitioning, scheduling, data transfer, and other related concerns are all handled by Modin under the hood." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "

\n", + "

pandas on a multicore laptop\n", + " \n", + " Modin on a multicore laptop\n", + " \n", + "\n", + "
\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Concept for exercise: Dataframe constructor\n", + "\n", + "Often when playing around in Pandas, it is useful to create a DataFrame with the constructor. That is where we will start.\n", + "\n", + "```python\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "frame_data = np.random.randint(0, 100, size=(2**10, 2**5))\n", + "df = pd.DataFrame(frame_data)\n", + "```\n", + "\n", + "When creating a dataframe from a non-distributed object, it will take extra time to partition the data. When this is happening, you will see this message:\n", + "\n", + "```\n", + "UserWarning: Distributing object. This may take some time.\n", + "```\n", + "\n", + "**In a later exercise, we will discuss times when it is not possible to speed up the computation, even with multiprocessing or multithreading.**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Note: Do not change this code!\n", + "import numpy as np\n", + "import pandas\n", + "import subprocess\n", + "import sys\n", + "import modin" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pandas.__version__" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "modin.__version__" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Implement your answer here. You are also free to play with the size\n", + "# and shape of the DataFrame, but beware of exceeding your memory!\n", + "\n", + "import pandas as pd\n", + "\n", + "frame_data = np.random.randint(0, 100, size=(2**10, 2**5))\n", + "df = pd.DataFrame(frame_data)\n", + "\n", + "# ***** Do not change the code below! It verifies that \n", + "# ***** the exercise has been done correctly. *****\n", + "\n", + "try:\n", + " assert df is not None\n", + " assert frame_data is not None\n", + " assert isinstance(frame_data, np.ndarray)\n", + "except:\n", + " raise AssertionError(\"Don't change too much of the original code!\")\n", + "assert \"modin.pandas\" in sys.modules, \"Not quite correct. Remember the single line of code change (See above)\"\n", + "\n", + "import modin.pandas\n", + "assert pd == modin.pandas, \"Remember the single line of code change (See above)\"\n", + "assert hasattr(df, \"_query_compiler\"), \"Make sure that `df` is a modin.pandas DataFrame.\"\n", + "\n", + "print(\"Success! You only need to change one line of code!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we have created a toy example for playing around with the DataFrame, let's print it out in different ways." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Concept for Exercise: Data Interaction and Printing\n", + "\n", + "When interacting with data, it is very imporant to look at different parts of the data (e.g. `df.head()`). Here we will show that you can print the modin.pandas DataFrame in the same ways you would pandas." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Print the first 10 lines.\n", + "df.head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Print the DataFrame.\n", + "df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Free cell for custom interaction (Play around here!)\n", + "df.add_prefix(\"col\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df.count()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Please move on to Exercise 2 when you are ready**" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/examples/tutorial/tutorial_notebooks/introduction/exercise_2.ipynb b/examples/tutorial/tutorial_notebooks/introduction/exercise_2.ipynb new file mode 100644 index 00000000000..65789ed3880 --- /dev/null +++ b/examples/tutorial/tutorial_notebooks/introduction/exercise_2.ipynb @@ -0,0 +1,525 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![LOGO](../img/MODIN_ver2_hrz.png)\n", + "\n", + "

Scale your pandas workflows by changing one line of code

\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Exercise 2: Speed improvements\n", + "\n", + "**GOAL**: Learn about common functionality that Modin speeds up by using all of your machine's cores." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Concept for Exercise: `read_csv` speedups\n", + "\n", + "The most commonly used data ingestion method used in pandas is CSV files (link to pandas survey). This concept is designed to give an idea of the kinds of speedups possible, even on a non-distributed filesystem. Modin also supports other file formats for parallel and distributed reads, which can be found in the documentation.\n", + "\n", + "![](../img/read_csv_perf.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will import both Modin and pandas so that the speedups are evident.\n", + "\n", + "**Note: Rerunning the `read_csv` cells many times may result in degraded performance, depending on the memory of the machine**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import modin.pandas as pd\n", + "import pandas as old_pd\n", + "import time\n", + "from IPython.display import Markdown, display\n", + "\n", + "def printmd(string):\n", + " display(Markdown(string))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Dataset: 2015 NYC taxi trip data\n", + "\n", + "Link to raw dataset: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page\n", + "\n", + "We will be using a version of this data already in S3, originally posted in this blog post: https://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes\n", + "\n", + "**Size: ~2GB**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s3_path = \"s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## `pandas.read_csv`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "start = time.time()\n", + "\n", + "pandas_df = old_pd.read_csv(s3_path, parse_dates=[\"tpep_pickup_datetime\", \"tpep_dropoff_datetime\"], quoting=3)\n", + "\n", + "end = time.time()\n", + "pandas_duration = end - start\n", + "print(\"Time to read with pandas: {} seconds\".format(round(pandas_duration, 3)))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Expect pandas to take >3 minutes on EC2, longer locally\n", + "\n", + "This is a good time to chat with your neighbor\n", + "Dicussion topics\n", + "- Do you work with a large amount of data daily?\n", + "- How big is your data?\n", + "- What’s the common use case of your data?\n", + "- Do you use any big data analytics tools?\n", + "- Do you use any interactive analytics tool?\n", + "- What’s are some drawbacks of your current interative analytic tools today?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## `modin.pandas.read_csv`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "start = time.time()\n", + "\n", + "modin_df = pd.read_csv(s3_path, parse_dates=[\"tpep_pickup_datetime\", \"tpep_dropoff_datetime\"], quoting=3)\n", + "\n", + "end = time.time()\n", + "modin_duration = end - start\n", + "print(\"Time to read with Modin: {} seconds\".format(round(modin_duration, 3)))\n", + "\n", + "printmd(\"### Modin is {}x faster than pandas at `read_csv`!\".format(round(pandas_duration / modin_duration, 2)))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Are they equal?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pandas_df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "modin_df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Concept for exercise: Reductions\n", + "\n", + "In pandas, a reduction would be something along the lines of a `sum` or `count`. It computes some summary statistics about the rows or columns. We will be using `count`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "start = time.time()\n", + "\n", + "pandas_count = pandas_df.count()\n", + "\n", + "end = time.time()\n", + "pandas_duration = end - start\n", + "\n", + "print(\"Time to count with pandas: {} seconds\".format(round(pandas_duration, 3)))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "start = time.time()\n", + "\n", + "modin_count = modin_df.count()\n", + "\n", + "end = time.time()\n", + "modin_duration = end - start\n", + "print(\"Time to count with Modin: {} seconds\".format(round(modin_duration, 3)))\n", + "\n", + "printmd(\"### Modin is {}x faster than pandas at `count`!\".format(round(pandas_duration / modin_duration, 2)))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Are they equal?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pandas_count" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "modin_count" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Concept for exercise: Map operations\n", + "\n", + "In pandas, map operations are operations that do a single pass over the data and do not change its shape. Operations like `isnull` and `applymap` are included in this. We will be using `isnull`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "start = time.time()\n", + "\n", + "pandas_isnull = pandas_df.isnull()\n", + "\n", + "end = time.time()\n", + "pandas_duration = end - start\n", + "\n", + "print(\"Time to isnull with pandas: {} seconds\".format(round(pandas_duration, 3)))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "start = time.time()\n", + "\n", + "modin_isnull = modin_df.isnull()\n", + "\n", + "end = time.time()\n", + "modin_duration = end - start\n", + "print(\"Time to isnull with Modin: {} seconds\".format(round(modin_duration, 3)))\n", + "\n", + "printmd(\"### Modin is {}x faster than pandas at `isnull`!\".format(round(pandas_duration / modin_duration, 2)))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Are they equal?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pandas_isnull" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "modin_isnull" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Concept for exercise: Apply over a single column\n", + "\n", + "Sometimes we want to compute some summary statistics on a single column from our dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "start = time.time()\n", + "rounded_trip_distance_pandas = pandas_df[\"trip_distance\"].apply(round)\n", + "\n", + "end = time.time()\n", + "pandas_duration = end - start\n", + "print(\"Time to groupby with pandas: {} seconds\".format(round(pandas_duration, 3)))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "start = time.time()\n", + "\n", + "rounded_trip_distance_modin = modin_df[\"trip_distance\"].apply(round)\n", + "\n", + "end = time.time()\n", + "modin_duration = end - start\n", + "print(\"Time to add a column with Modin: {} seconds\".format(round(modin_duration, 3)))\n", + "\n", + "printmd(\"### Modin is {}x faster than pandas at `apply` on one column!\".format(round(pandas_duration / modin_duration, 2)))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Are they equal?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "rounded_trip_distance_pandas" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "rounded_trip_distance_modin" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Concept for exercise: Add a column\n", + "\n", + "It is common to need to add a new column to an existing dataframe, here we show that this is significantly faster in Modin due to metadata management and an efficient zero copy implementation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "start = time.time()\n", + "pandas_df[\"rounded_trip_distance\"] = rounded_trip_distance_pandas\n", + "\n", + "end = time.time()\n", + "pandas_duration = end - start\n", + "print(\"Time to groupby with pandas: {} seconds\".format(round(pandas_duration, 3)))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "start = time.time()\n", + "\n", + "modin_df[\"rounded_trip_distance\"] = rounded_trip_distance_modin\n", + "\n", + "end = time.time()\n", + "modin_duration = end - start\n", + "print(\"Time to add a column with Modin: {} seconds\".format(round(modin_duration, 3)))\n", + "\n", + "printmd(\"### Modin is {}x faster than pandas add a column!\".format(round(pandas_duration / modin_duration, 2)))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Are they equal?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pandas_df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "modin_df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Concept for exercise: Groupby and aggregate\n", + "\n", + "In pandas, you can groupby and aggregate. We will groupby a column in the dataset and use count for our aggregate." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "start = time.time()\n", + "\n", + "pandas_groupby = pandas_df.groupby(by=\"rounded_trip_distance\").count()\n", + "\n", + "end = time.time()\n", + "pandas_duration = end - start\n", + "\n", + "print(\"Time to groupby with pandas: {} seconds\".format(round(pandas_duration, 3)))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "start = time.time()\n", + "\n", + "modin_groupby = modin_df.groupby(by=\"rounded_trip_distance\").count()\n", + "\n", + "end = time.time()\n", + "modin_duration = end - start\n", + "print(\"Time to groupby with Modin: {} seconds\".format(round(modin_duration, 3)))\n", + "\n", + "printmd(\"### Modin is {}x faster than pandas at `groupby`!\".format(round(pandas_duration / modin_duration, 2)))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Are they equal?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pandas_groupby" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "modin_groupby" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Please move on to tutorial 3" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/examples/tutorial/tutorial_notebooks/introduction/exercise_3.ipynb b/examples/tutorial/tutorial_notebooks/introduction/exercise_3.ipynb new file mode 100644 index 00000000000..6de80d3175a --- /dev/null +++ b/examples/tutorial/tutorial_notebooks/introduction/exercise_3.ipynb @@ -0,0 +1,303 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![LOGO](../img/MODIN_ver2_hrz.png)\n", + "\n", + "

Scale your pandas workflows by changing one line of code

\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Exercise 3: Not Implemented\n", + "\n", + "**GOAL**: Learn what happens when a function is not yet supported in Modin and what functionality is not possible to accelerate" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When functionality has not yet been implemented, we default to pandas\n", + "\n", + "![](../img/convert_to_pandas.png)\n", + "\n", + "We convert a Modin dataframe to pandas to do the operation, then convert it back once it is finished. These operations will have a high overhead due to the communication involved and will take longer than pandas.\n", + "\n", + "When this is happening, a warning will be given to the user to inform them that this operation will take longer than usual. For example, `DataFrame.kurtosis` is not yet implemented. In this case, when a user tries to use it, they will see this warning:\n", + "\n", + "```\n", + "UserWarning: `DataFrame.kurtosis` defaulting to pandas implementation.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Concept for exercise: Default to pandas\n", + "\n", + "In this section of the exercise we will see first-hand how the runtime is affected by operations that are not implemented." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import modin.pandas as pd\n", + "import pandas\n", + "import numpy as np\n", + "import time\n", + "\n", + "frame_data = np.random.randint(0, 100, size=(2**18, 2**8))\n", + "df = pd.DataFrame(frame_data).add_prefix(\"col\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pandas_df = pandas.DataFrame(frame_data).add_prefix(\"col\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "modin_start = time.time()\n", + "\n", + "print(df.kurtosis())\n", + "\n", + "modin_end = time.time()\n", + "print(\"Modin kurtosis took {} seconds.\".format(round(modin_end - modin_start, 4)))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pandas_start = time.time()\n", + "\n", + "print(pandas_df.kurtosis())\n", + "\n", + "pandas_end = time.time()\n", + "print(\"pandas kurtosis took {} seconds.\".format(round(pandas_end - pandas_start, 4)))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Concept for exercise: Register custom functions\n", + "\n", + "Modin's user-facing API is pandas, but it is possible that we do not yet support your favorite or most-needed functionalities. Your user-defined function may also be able to be executed more efficiently if you pre-define the type of function it is (e.g. map, reduction, etc.). To solve either case, it is possible to register a custom function to be applied to your data." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Registering a custom function for all query compilers\n", + "\n", + "To register a custom function for a query compiler, we first need to import it:\n", + "\n", + "```python\n", + "from modin.backends.pandas.query_compiler import PandasQueryCompiler\n", + "```\n", + "\n", + "The `PandasQueryCompiler` is responsible for defining and compiling the queries that can be operated on by Modin, and is specific to the pandas backend. Any queries defined here must also both be compatible with and result in a `pandas.DataFrame`. Many functionalities are very simply implemented, as you can see in the current code: [Link](https://github.com/modin-project/modin/blob/f15fb8ea776ed039893130b1e85053e875912d4b/modin/backends/pandas/query_compiler.py#L365).\n", + "\n", + "If we want to register a new function, we next to understand what kind of function it is. In our example, we will use `kurtosis`, which is a reduction. So we next want to import the function type so we can use it in our definition:\n", + "\n", + "```python\n", + "from modin.data_management.functions import ReductionFunction\n", + "```\n", + "\n", + "Then we can just use the `ReductionFunction.register` `classmethod` and assign it to the `PandasQueryCompiler`:\n", + "\n", + "```python\n", + "PandasQueryCompiler.kurtosis = ReductionFunction.register(pandas.DataFrame.kurtosis)\n", + "```\n", + "\n", + "Finally, we want a handle to it from the `DataFrame`, so we need to create a way to do that:\n", + "\n", + "```python\n", + "def kurtosis_func(self, **kwargs):\n", + " # The constructor allows you to pass in a query compiler as a keyword argument\n", + " return self.__constructor__(query_compiler=self._query_compiler.kurtosis(**kwargs))\n", + "\n", + "pd.DataFrame.kurtosis_custom = kurtosis_func\n", + "```\n", + "\n", + "And then you can use it like you usually would:\n", + "\n", + "```python\n", + "df.kurtosis_custom()\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from modin.backends.pandas.query_compiler import PandasQueryCompiler\n", + "from modin.data_management.functions import ReductionFunction" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "PandasQueryCompiler.kurtosis_custom = ReductionFunction.register(pandas.DataFrame.kurtosis)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# The function signature came from the pandas documentation:\n", + "# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.kurtosis.html\n", + "def kurtosis_func(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs):\n", + " # We need to specify the axis for the backend\n", + " if axis is None:\n", + " axis = 0\n", + " # The constructor allows you to pass in a query compiler as a keyword argument\n", + " # Reduce dimension is used for reductions\n", + " # We also pass all keyword arguments here to ensure correctness\n", + " return self._reduce_dimension(\n", + " self._query_compiler.kurtosis_custom(\n", + " axis=axis, skipna=skipna, level=level, numeric_only=numeric_only, **kwargs\n", + " )\n", + " )\n", + "\n", + "pd.DataFrame.kurtosis_custom = kurtosis_func" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "start = time.time()\n", + "\n", + "print(df.kurtosis())\n", + "\n", + "end = time.time()\n", + "print(\"Modin kurtosis took {} seconds.\".format(end - start))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "start = time.time()\n", + "\n", + "print(df.kurtosis_custom())\n", + "\n", + "end = time.time()\n", + "print(\"Modin kurtosis_custom took {} seconds.\".format(end - start))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Congratulations! You have just implemented `DataFrame.kurtosis`!\n", + "\n", + "## Consider opening a pull request: https://github.com/modin-project/modin/pulls\n", + "\n", + "For a complete list of what is implemented, see the [documentation](https://modin.readthedocs.io/en/latest/UsingPandasonRay/dataframe_supported.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Test your knowledge: Add a custom function for another reduction: `DataFrame.mad`\n", + "\n", + "See the pandas documentation for the correct signature: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mad.html" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "modin_mad_start = time.time()\n", + "\n", + "# Implement your function here! Put the result of your custom `mad` in the variable `modin_mad`\n", + "# Hint: Look at the kurtosis walkthrough above\n", + "modin_mad = ...\n", + "\n", + "modin_mad_end = time.time()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Evaluation code, do not change!\n", + "pandas_mad_start = time.time()\n", + "pandas_mad = pandas_df.mad()\n", + "pandas_mad_end = time.time()\n", + "\n", + "assert isinstance(modin_mad, pd.Series), \"This is not a distributed Modin object, try again\"\n", + "assert pandas_mad_end - pandas_mad_start > modin_mad_end - modin_mad_start, \\\n", + " \"Your implementation was too slow, or you used the defaulting to pandas approach. Try again\"\n", + "assert modin_mad._to_pandas().equals(pandas_mad), \"Your result did not match the result of pandas, tray again\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Now that you are able to create custom functions, you know enough to contribute to Modin!" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}