Skip to content

Commit

Permalink
Restored code from extras sections
Browse files Browse the repository at this point in the history
  • Loading branch information
davewhipp committed Oct 2, 2019
1 parent ec0b52f commit 397aa43
Show file tree
Hide file tree
Showing 2 changed files with 83 additions and 29 deletions.
47 changes: 34 additions & 13 deletions notebooks/L5/exploring-data-using-pandas.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
"\n",
"These Pandas structures incorporate a number of things we've already encountered, such as indices, data stored in a collection, and data types. Let's have another look at the Pandas data structures below with some additional annotation.\n",
"\n",
"![Pandas data structures](img/pandas-structures-annotated.png)\n",
"![Pandas data structures annotated](img/pandas-structures-annotated.png)\n",
"\n",
"As you can see, both DataFrames and Series in pandas have an index that can be used to select values, but they also have column labels to identify columns in DataFrames. In the lesson this week we'll use many of these features to explore real-world data and learn some useful data analysis procedures.\n",
"\n",
Expand Down Expand Up @@ -314,7 +314,7 @@
"\n",
"**Note**\n",
"\n",
" We can use [IPython magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html#line-magics) to figure out what variables we have in memory. IPython magic command `%who` will display names of those variables that you have defined during this session. Magic command `%whose` prints out more information about these variables.\n",
" We can use [IPython magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html#line-magics) to figure out what variables we have in memory. IPython magic command `%who` will display names of those variables that you have defined during this session. Magic command `%whos` prints out more information about these variables.\n",
" \n",
" \n",
"</div>"
Expand Down Expand Up @@ -342,7 +342,8 @@
"metadata": {},
"outputs": [],
"source": [
"# Display variable name, type and info\n"
"# Display variable name, type and info\n",
"%whos"
]
},
{
Expand Down Expand Up @@ -672,7 +673,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We could, for example, check the mean temperature in our inpu data. We check the mean for a single column (*Series*): "
"We could, for example, check the mean temperature in our input data. We check the mean for a single column (*Series*): "
]
},
{
Expand Down Expand Up @@ -759,7 +760,9 @@
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
Expand All @@ -773,7 +776,9 @@
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
"source": [
"data[[\"TEMP\", \"MAX\", \"MIN\"]].plot()"
]
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -814,7 +819,9 @@
},
"outputs": [],
"source": [
"# Create Pandas Series from a list\n"
"# Create Pandas Series from a list\n",
"number_series = pd.Series([ 4, 5, 6, 7.0])\n",
"print(number_series)"
]
},
{
Expand All @@ -839,14 +846,19 @@
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
"source": [
"number_series = pd.Series([ 4, 5, 6, 7.0], index=['a','b','c','d'])\n",
"print(number_series)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
"source": [
"type(number_series)"
]
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -883,14 +895,19 @@
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
"source": [
"new_data = pd.DataFrame(data = {\"station_name\" : stations, \"lat\" : lats, \"lon\" : lons})\n",
"new_data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
"source": [
"type(new_data)"
]
},
{
"cell_type": "markdown",
Expand All @@ -904,14 +921,18 @@
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
"source": [
"df = pd.DataFrame()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
"source": [
"print(df)"
]
},
{
"cell_type": "markdown",
Expand Down
65 changes: 49 additions & 16 deletions notebooks/L5/processing-data-with-pandas.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -245,7 +245,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Add column \"TEMP_KELVIN\"\n"
"# Add column \"TEMP_KELVIN\" "
]
},
{
Expand All @@ -262,7 +262,8 @@
"\n",
"**Selecting several rows:**\n",
"\n",
"One common way of selecting only specific rows from your DataFrame is done via **index slicing** to extract part of the DataFrame.\n",
"One common way of selecting only specific rows from your DataFrame is done via **index slicing** to extract part of the DataFrame. Slicing in pandas can be done in a similar manner as with normal Python lists, i.e. you specify index range you want to select inside the square brackets ``selection = dataframe[start_index:stop_index]``.\n",
"\n",
"Let's select the first five rows and assign them to a variable called `selection`:"
]
},
Expand All @@ -279,7 +280,7 @@
},
"outputs": [],
"source": [
"# Select first five rows of dataframe\n",
"# Select first five rows of dataframe using index values\n",
"\n",
"\n"
]
Expand All @@ -291,8 +292,7 @@
"editable": true
},
"source": [
"As you can see, slicing can be done in a similar manner as with normal Python lists, i.e. you specify index range you want to select inside the square brackets\n",
"``selection = dataframe[start_index:stop_index]``.\n"
"**Note:** here selected the first five rows (index 0-4) using integer index. \n"
]
},
{
Expand All @@ -305,7 +305,9 @@
"**Selecting several rows and columns:**\n",
"\n",
"\n",
"It is also possible to control which columns are chosen, while selecting a subset of rows. Here, we select only temperature values (`TEMP`) between on rows index 0-5:\n"
"It is also possible to control which columns are chosen when selecting a subset of rows. In this case we will use [pandas.DataFrame.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) which selects data based on axis labels (row labels and column labels). \n",
"\n",
"Let's select temperature values (column `TEMP`) on rows 0-5:\n"
]
},
{
Expand All @@ -321,19 +323,26 @@
},
"outputs": [],
"source": [
"# Select temp column values between indices 5 and 10\n",
"# Select temp column values on rows 0-5\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:** in this case, we get six rows of data (index 0-5)! We are now doing the selection based on axis labels in stead of the integer index."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"It is also possible to select multiple columns using those same indices. Here, we select `TEMP` and the `TEMP_CELSIUS` columns from a set of rows by passing them inside a list (`.loc[start_index:stop_index, list_of_columns]`):"
"It is also possible to select multiple columns when using `loc`. Here, we select `TEMP` and the `TEMP_CELSIUS` columns from a set of rows by passing them inside a list (`.loc[start_index:stop_index, list_of_columns]`):"
]
},
{
Expand All @@ -349,7 +358,7 @@
},
"outputs": [],
"source": [
"# Select temp and temp_celsius column values between indices 5 and 10\n",
"# Select columns temp and temp_celsius on rows 0-5\n",
"\n",
"\n"
]
Expand Down Expand Up @@ -466,7 +475,7 @@
"`.loc` and `.at` are based on the *axis labels* - the names of columns and rows. \n",
"`.iloc` is another indexing operator which is based on *integer values*. \n",
" \n",
"See pandas documentation for more information about [indexing and selecting data](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-and-selecting-data)\n",
"See pandas documentation for more information about [indexing and selecting data](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-and-selecting-data).\n",
" \n",
"</div>"
]
Expand All @@ -483,7 +492,9 @@
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
"source": [
"data.iloc[0:5:,0:2]"
]
},
{
"cell_type": "markdown",
Expand All @@ -498,7 +509,25 @@
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
"source": [
"data.iloc[0,1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also access individual rows using `iloc`. Let's check out the last row of data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data.iloc[-1]"
]
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -656,7 +685,7 @@
"source": [
"As you can see by looking at the table above (and the change in index values), we now have a DataFrame without the NoData values.\n",
"\n",
"Another option is to fill the NoData with some value using the `fillna()` function. Here we can fill the missing values in the with value 0. Note that we are not giving the `subset` parameter this time."
"Another option is to fill the NoData with some value using the `fillna()` function. Here we can fill the missing values in the with value -9999. Note that we are not giving the `subset` parameter this time."
]
},
{
Expand All @@ -672,7 +701,7 @@
},
"outputs": [],
"source": [
"# Fill na values with 0\n"
"# Fill na values\n"
]
},
{
Expand All @@ -682,7 +711,7 @@
"editable": true
},
"source": [
"As a result we now have a DataFrame where NoData values are filled with the value 0.0."
"As a result we now have a DataFrame where NoData values are filled with the value -9999."
]
},
{
Expand All @@ -694,7 +723,11 @@
"source": [
"<div class=\"alert alert-warning\">\n",
"\n",
"**Warning:** In many cases filling the data with a specific value is dangerous because you end up modifying the actual data, which might affect the results of your analysis. For example, in the case above we would have dramatically changed the temperature difference columns because the 0 values not an actual temperature difference! Hence, use caution when filling missing values.\n",
"**Warning:** \n",
" \n",
"In many cases filling the data with a specific value is dangerous because you end up modifying the actual data, which might affect the results of your analysis. For example, in the case above we would have dramatically changed the temperature difference columns because the -9999 values not an actual temperature difference! Hence, use caution when filling missing values. \n",
" \n",
"You might have to fill in no data values, for example, when working with GIS data. Always pay attention to potential no data values when reading in data files and doing further analysis!\n",
"\n",
"</div>"
]
Expand Down

0 comments on commit 397aa43

Please sign in to comment.