diff --git a/notebooks/L5/exploring-data-using-pandas.ipynb b/notebooks/L5/exploring-data-using-pandas.ipynb index ff60b1f..33c7969 100644 --- a/notebooks/L5/exploring-data-using-pandas.ipynb +++ b/notebooks/L5/exploring-data-using-pandas.ipynb @@ -20,7 +20,7 @@ "\n", "These Pandas structures incorporate a number of things we've already encountered, such as indices, data stored in a collection, and data types. Let's have another look at the Pandas data structures below with some additional annotation.\n", "\n", - "![Pandas data structures](img/pandas-structures-annotated.png)\n", + "![Pandas data structures annotated](img/pandas-structures-annotated.png)\n", "\n", "As you can see, both DataFrames and Series in pandas have an index that can be used to select values, but they also have column labels to identify columns in DataFrames. In the lesson this week we'll use many of these features to explore real-world data and learn some useful data analysis procedures.\n", "\n", @@ -314,7 +314,7 @@ "\n", "**Note**\n", "\n", - " We can use [IPython magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html#line-magics) to figure out what variables we have in memory. IPython magic command `%who` will display names of those variables that you have defined during this session. Magic command `%whose` prints out more information about these variables.\n", + " We can use [IPython magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html#line-magics) to figure out what variables we have in memory. IPython magic command `%who` will display names of those variables that you have defined during this session. Magic command `%whos` prints out more information about these variables.\n", " \n", " \n", "" @@ -342,7 +342,8 @@ "metadata": {}, "outputs": [], "source": [ - "# Display variable name, type and info\n" + "# Display variable name, type and info\n", + "%whos" ] }, { @@ -672,7 +673,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We could, for example, check the mean temperature in our inpu data. We check the mean for a single column (*Series*): " + "We could, for example, check the mean temperature in our input data. We check the mean for a single column (*Series*): " ] }, { @@ -759,7 +760,9 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "%matplotlib inline" + ] }, { "cell_type": "markdown", @@ -773,7 +776,9 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "data[[\"TEMP\", \"MAX\", \"MIN\"]].plot()" + ] }, { "cell_type": "markdown", @@ -814,7 +819,9 @@ }, "outputs": [], "source": [ - "# Create Pandas Series from a list\n" + "# Create Pandas Series from a list\n", + "number_series = pd.Series([ 4, 5, 6, 7.0])\n", + "print(number_series)" ] }, { @@ -839,14 +846,19 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "number_series = pd.Series([ 4, 5, 6, 7.0], index=['a','b','c','d'])\n", + "print(number_series)" + ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "type(number_series)" + ] }, { "cell_type": "markdown", @@ -883,14 +895,19 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "new_data = pd.DataFrame(data = {\"station_name\" : stations, \"lat\" : lats, \"lon\" : lons})\n", + "new_data" + ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "type(new_data)" + ] }, { "cell_type": "markdown", @@ -904,14 +921,18 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "df = pd.DataFrame()" + ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "print(df)" + ] }, { "cell_type": "markdown", diff --git a/notebooks/L5/processing-data-with-pandas.ipynb b/notebooks/L5/processing-data-with-pandas.ipynb index d164a40..5f8e823 100644 --- a/notebooks/L5/processing-data-with-pandas.ipynb +++ b/notebooks/L5/processing-data-with-pandas.ipynb @@ -245,7 +245,7 @@ "metadata": {}, "outputs": [], "source": [ - "# Add column \"TEMP_KELVIN\"\n" + "# Add column \"TEMP_KELVIN\" " ] }, { @@ -262,7 +262,8 @@ "\n", "**Selecting several rows:**\n", "\n", - "One common way of selecting only specific rows from your DataFrame is done via **index slicing** to extract part of the DataFrame.\n", + "One common way of selecting only specific rows from your DataFrame is done via **index slicing** to extract part of the DataFrame. Slicing in pandas can be done in a similar manner as with normal Python lists, i.e. you specify index range you want to select inside the square brackets ``selection = dataframe[start_index:stop_index]``.\n", + "\n", "Let's select the first five rows and assign them to a variable called `selection`:" ] }, @@ -279,7 +280,7 @@ }, "outputs": [], "source": [ - "# Select first five rows of dataframe\n", + "# Select first five rows of dataframe using index values\n", "\n", "\n" ] @@ -291,8 +292,7 @@ "editable": true }, "source": [ - "As you can see, slicing can be done in a similar manner as with normal Python lists, i.e. you specify index range you want to select inside the square brackets\n", - "``selection = dataframe[start_index:stop_index]``.\n" + "**Note:** here selected the first five rows (index 0-4) using integer index. \n" ] }, { @@ -305,7 +305,9 @@ "**Selecting several rows and columns:**\n", "\n", "\n", - "It is also possible to control which columns are chosen, while selecting a subset of rows. Here, we select only temperature values (`TEMP`) between on rows index 0-5:\n" + "It is also possible to control which columns are chosen when selecting a subset of rows. In this case we will use [pandas.DataFrame.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) which selects data based on axis labels (row labels and column labels). \n", + "\n", + "Let's select temperature values (column `TEMP`) on rows 0-5:\n" ] }, { @@ -321,11 +323,18 @@ }, "outputs": [], "source": [ - "# Select temp column values between indices 5 and 10\n", + "# Select temp column values on rows 0-5\n", "\n", "\n" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Note:** in this case, we get six rows of data (index 0-5)! We are now doing the selection based on axis labels in stead of the integer index." + ] + }, { "cell_type": "markdown", "metadata": { @@ -333,7 +342,7 @@ "editable": true }, "source": [ - "It is also possible to select multiple columns using those same indices. Here, we select `TEMP` and the `TEMP_CELSIUS` columns from a set of rows by passing them inside a list (`.loc[start_index:stop_index, list_of_columns]`):" + "It is also possible to select multiple columns when using `loc`. Here, we select `TEMP` and the `TEMP_CELSIUS` columns from a set of rows by passing them inside a list (`.loc[start_index:stop_index, list_of_columns]`):" ] }, { @@ -349,7 +358,7 @@ }, "outputs": [], "source": [ - "# Select temp and temp_celsius column values between indices 5 and 10\n", + "# Select columns temp and temp_celsius on rows 0-5\n", "\n", "\n" ] @@ -466,7 +475,7 @@ "`.loc` and `.at` are based on the *axis labels* - the names of columns and rows. \n", "`.iloc` is another indexing operator which is based on *integer values*. \n", " \n", - "See pandas documentation for more information about [indexing and selecting data](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-and-selecting-data)\n", + "See pandas documentation for more information about [indexing and selecting data](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-and-selecting-data).\n", " \n", "" ] @@ -483,7 +492,9 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "data.iloc[0:5:,0:2]" + ] }, { "cell_type": "markdown", @@ -498,7 +509,25 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [] + "source": [ + "data.iloc[0,1]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can also access individual rows using `iloc`. Let's check out the last row of data:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data.iloc[-1]" + ] }, { "cell_type": "markdown", @@ -656,7 +685,7 @@ "source": [ "As you can see by looking at the table above (and the change in index values), we now have a DataFrame without the NoData values.\n", "\n", - "Another option is to fill the NoData with some value using the `fillna()` function. Here we can fill the missing values in the with value 0. Note that we are not giving the `subset` parameter this time." + "Another option is to fill the NoData with some value using the `fillna()` function. Here we can fill the missing values in the with value -9999. Note that we are not giving the `subset` parameter this time." ] }, { @@ -672,7 +701,7 @@ }, "outputs": [], "source": [ - "# Fill na values with 0\n" + "# Fill na values\n" ] }, { @@ -682,7 +711,7 @@ "editable": true }, "source": [ - "As a result we now have a DataFrame where NoData values are filled with the value 0.0." + "As a result we now have a DataFrame where NoData values are filled with the value -9999." ] }, { @@ -694,7 +723,11 @@ "source": [ "
\n", "\n", - "**Warning:** In many cases filling the data with a specific value is dangerous because you end up modifying the actual data, which might affect the results of your analysis. For example, in the case above we would have dramatically changed the temperature difference columns because the 0 values not an actual temperature difference! Hence, use caution when filling missing values.\n", + "**Warning:** \n", + " \n", + "In many cases filling the data with a specific value is dangerous because you end up modifying the actual data, which might affect the results of your analysis. For example, in the case above we would have dramatically changed the temperature difference columns because the -9999 values not an actual temperature difference! Hence, use caution when filling missing values. \n", + " \n", + "You might have to fill in no data values, for example, when working with GIS data. Always pay attention to potential no data values when reading in data files and doing further analysis!\n", "\n", "
" ]