Skip to content

Commit

Permalink
initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
maartenbreddels committed Dec 12, 2018
1 parent 5dba691 commit aab2c5d
Show file tree
Hide file tree
Showing 6 changed files with 319 additions and 0 deletions.
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Try out vaex in a Jupyter notebook with a single click on mybinder


* Notebooks for the [Medium article: Out of Core Dataframes for Python](https://medium.com/p/12c102db044a/edit)
* [Play with the snippets from the article ![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/vaexio/vaex-mybinder/master?filepath=medium-out-of-core%2Farticle_snippets.ipynb)
1 change: 1 addition & 0 deletions binder/apt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
wget
11 changes: 11 additions & 0 deletions binder/postBuild
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash

set -ex

# keep git happy
git config --global user.email "[email protected]"
git config --global user.name "Bin Der"


# jupyter labextension install @jupyter-widgets/jupyterlab-manager jupyter-threejs ipyvolume -bqplot -no-build
# jupyter lab build
8 changes: 8 additions & 0 deletions binder/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
vaex-core>=0.5.1
vaex-hdf5
vaex-arrow
vaex-jupyter
vaex-viz
numba
scipy
notebook>=5.4
294 changes: 294 additions & 0 deletions medium-out-of-core/article_snippets.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,294 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import vaex\n",
"import numpy as np\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# load the example dataset\n",
"# df = vaex.example()\n",
"\n",
"# or downloads a slightly larger version of the example dataset\n",
"df = vaex.datasets.helmi_de_zeeuw.fetch()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Expressions\n",
"Expressions are only evaluated when needed by vaex, and save you memory."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"np.sqrt(df.x**2 + df.y**2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Virtual columns\n",
"Expression can be added to a DataFrame to create a virtual column. A virtual column can be treated the same as a normal column, except it does not use up RAM."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['r'] = np.sqrt(df.x**2 + df.y**2)\n",
"df[['x', 'y', 'r']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.r.mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# JIT (Just in time) compilation\n",
"If an expression becomes to show, try optimizing it with numba, or Pythran"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['r_normal'] = np.sqrt(df.x**2 + df.y**2)\n",
"df['r_jit'] = np.sqrt(df.x**2 + df.y**2).jit_numba()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%timeit -n3 -r10\n",
"df.mean(df.r_normal)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%timeit -n3 -r10\n",
"df.mean(df.r_jit)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Materialize\n",
"Or, if you have plenty of RAM, materialize the column."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_m = df.materialize('r')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%timeit -n3 -r10\n",
"df_m.mean(df.r)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Filtering\n",
"Filtering makes no copy of the data, ideal when exploring your 1TB dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_filtered = df[df.x > 0]\n",
"df_filtered[['x', 'y', 'r']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Selections\n",
"All statistical functions can take 1 or more selections as arguments. Multiple selections allow for multiple computations in 1 pass over the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.mean(df.x, selection=[df.x < 0, df.x > 0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data cleansing\n",
"Even fillna does not use memory, try different values without wasting time or RAM."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_fillna_0 = df.fillna(value=0, column_names=['x'])\n",
"df_fillna_3 = df.fillna(value=3, column_names=['x'])\n",
"df_fillna_5 = df.fillna(value=5, column_names=['x'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# N-d statistics\n",
"All statistical methods can be computed on N-dimensional regular grids."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.mean(df.x, binby=df.y, limits=[-10, 10], shape=20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Visualization\n",
"The N-d statistics are the basis for many of the build-in visualizations."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.plot1d(df.x, limits=[-10, 10]);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.plot(df.x, df.y, limits=[-10, 10]);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Interactive viz\n",
"Based on ipywidgets / bqplot, you can even do interactive visualization\n",
"\n",
"*Note that (since we are on mybinder) we only use 100.000 rows, instead of 150.000.000 or >1.000.000.000 rows. Download it from https://docs.vaex.io/en/latest/datasets.html if you want to try it out on your local computer.*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# the first 100,000 rows \n",
"df_taxi = vaex.open('./nyc_taxi_2015_100k.arrow')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_taxi.plot_widget(df_taxi.dropoff_longitude, df_taxi.dropoff_latitude, shape=400,\n",
" f='log1p', controls_selection=True)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Binary file added medium-out-of-core/nyc_taxi_2015_100k.arrow
Binary file not shown.

0 comments on commit aab2c5d

Please sign in to comment.