initial commit

vaexio · Dec 12, 2018 · aab2c5d · aab2c5d
1 parent 5dba691
commit aab2c5d
Show file tree

Hide file tree

Showing 6 changed files with 319 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,5 @@
+# Try out vaex in a Jupyter notebook with a single click on mybinder
+
+
+* Notebooks for the [Medium article: Out of Core Dataframes for Python](https://medium.com/p/12c102db044a/edit)
+   * [Play with the snippets from the article ![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/vaexio/vaex-mybinder/master?filepath=medium-out-of-core%2Farticle_snippets.ipynb)
diff --git a/binder/apt.txt b/binder/apt.txt
@@ -0,0 +1 @@
+wget
diff --git a/binder/postBuild b/binder/postBuild
@@ -0,0 +1,11 @@
+#!/bin/bash
+
+set -ex
+
+# keep git happy
+git config --global user.email "[email protected]"
+git config --global user.name "Bin Der"
+
+
+# jupyter labextension install @jupyter-widgets/jupyterlab-manager jupyter-threejs ipyvolume -bqplot -no-build
+# jupyter lab build
diff --git a/binder/requirements.txt b/binder/requirements.txt
@@ -0,0 +1,8 @@
+vaex-core>=0.5.1
+vaex-hdf5
+vaex-arrow
+vaex-jupyter
+vaex-viz
+numba
+scipy
+notebook>=5.4
diff --git a/medium-out-of-core/article_snippets.ipynb b/medium-out-of-core/article_snippets.ipynb
@@ -0,0 +1,294 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import vaex\n",
+    "import numpy as np\n",
+    "%matplotlib inline"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# load the example dataset\n",
+    "# df = vaex.example()\n",
+    "\n",
+    "# or downloads a slightly larger version of the example dataset\n",
+    "df = vaex.datasets.helmi_de_zeeuw.fetch()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Expressions\n",
+    "Expressions are only evaluated when needed by vaex, and save you memory."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "np.sqrt(df.x**2 + df.y**2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Virtual columns\n",
+    "Expression can be added to a DataFrame to create a virtual column. A virtual column can be treated the same as a normal column, except it does not use up RAM."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df['r'] = np.sqrt(df.x**2 + df.y**2)\n",
+    "df[['x', 'y', 'r']]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.r.mean()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# JIT (Just in time) compilation\n",
+    "If an expression becomes to show, try optimizing it with numba, or Pythran"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df['r_normal'] = np.sqrt(df.x**2 + df.y**2)\n",
+    "df['r_jit'] = np.sqrt(df.x**2 + df.y**2).jit_numba()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%timeit -n3 -r10\n",
+    "df.mean(df.r_normal)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%timeit -n3 -r10\n",
+    "df.mean(df.r_jit)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Materialize\n",
+    "Or, if you have plenty of RAM, materialize the column."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_m = df.materialize('r')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%timeit -n3 -r10\n",
+    "df_m.mean(df.r)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Filtering\n",
+    "Filtering makes no copy of the data, ideal when exploring your 1TB dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_filtered = df[df.x > 0]\n",
+    "df_filtered[['x', 'y', 'r']]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Selections\n",
+    "All statistical functions can take 1 or more selections as arguments. Multiple selections allow for multiple computations in 1 pass over the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.mean(df.x, selection=[df.x < 0, df.x > 0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Data cleansing\n",
+    "Even fillna does not use memory, try different values without wasting time or RAM."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_fillna_0 = df.fillna(value=0, column_names=['x'])\n",
+    "df_fillna_3 = df.fillna(value=3, column_names=['x'])\n",
+    "df_fillna_5 = df.fillna(value=5, column_names=['x'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# N-d statistics\n",
+    "All statistical methods can be computed on N-dimensional regular grids."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.mean(df.x, binby=df.y, limits=[-10, 10], shape=20)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Visualization\n",
+    "The N-d statistics are the basis for many of the build-in visualizations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.plot1d(df.x, limits=[-10, 10]);"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.plot(df.x, df.y, limits=[-10, 10]);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Interactive viz\n",
+    "Based on ipywidgets / bqplot, you can even do interactive visualization\n",
+    "\n",
+    "*Note that (since we are on mybinder) we only use 100.000 rows, instead of 150.000.000 or >1.000.000.000 rows. Download it from https://docs.vaex.io/en/latest/datasets.html if you want to try it out on your local computer.*"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# the first 100,000 rows \n",
+    "df_taxi = vaex.open('./nyc_taxi_2015_100k.arrow')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_taxi.plot_widget(df_taxi.dropoff_longitude, df_taxi.dropoff_latitude, shape=400,\n",
+    "                    f='log1p', controls_selection=True)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/medium-out-of-core/nyc_taxi_2015_100k.arrow b/medium-out-of-core/nyc_taxi_2015_100k.arrow