Practicum 2: Using Sequential Neural Networks and Transfer Learning to Assess Biodiversity at Karoo National Park, South Africa
Matthew Clark
INTRODUCTION
Camera traps offer an unprecedented method to gain a large amount of data on wildlife in a noninvasive manner. Camera traps are remotely triggered cameras that use motion sensors to take photos when animals walk by. Camera traps also present a great opportunity for harnessing big data in wildlife biology (Ahumada et. al. 2019). However, a limitation of this type of data is the time needed to identify the animals in the images. This is where convolutional neural networks come in as these machine learning algorithms can quickly identify large numbers of images.
However, animals do not pose for pictures, and it is possible that the angle of the animal in a photo can affect the ability of the algorithm to identify the animal in question. This includes situations in which the animal is seen from the front or from the rear rather than from the side. In these situations, the same animal can look very different. In addition, sexual dimorphism can be a factor in identifying an animal in a camera trap image. This algorithm will attempt to discover if the position of the animal can affect the ability of a convolutional neural network to identify a species. In addition, it is possible that different keras applications can have greater success than others in correctly classifying the different categories of animal involved with camera trap images.
The goal of this assignment is to determine if the position of the animal and the type of Keras architecture affects the ability of a CNN to identify an animal. Data will be obtained from three datasets, Snapshot Karoo: Season 1, Snapshot Camdeboo: Season 1 and Snapshot Kgalagadi: Season 1. These datasets are part of the Snapshot Safari program and are available through the Labeled Information Library of Alexandria (LILA BC) (n.d.). This lab will look at three categories of South African herbivore, the bull greater kudu (Tragelaphus strepsiceros), the cow eland (Taurotragus oryx) and mountain zebra (Equus zebra). Datasets will include a control dataset, in which all data for each category of antelope will be included in a single file, and an experimental dataset in which the data for each category is separated by front, rear and side views of the animal. The performance of both the control group and the experimental group will be assessed to determine the effectiveness of the algorithm.
In addition, the performance of the CNN on the control and experimental groups will be tested using a sequential neural network with several different Keras applications. These will include the ResNet101 architecture and the VGG16 architecture. The accuracy of the algorithm in identifying individual categories, a confusion matrix will also be produced to see which categories the algorithm has difficulty identifying.
This lab will utilize code created by Iftekher Mamum (2019). This code was chosen for its ability to quickly iterate through epochs, its higher accuracy and the ease at which it can create and display a confusion matrix. Its use of the VGG16 algorithm worked out well for this practicum because this architecture possesses fewer layers (Mamun 2016) and this allowed it to run quickly. Due to the pandemic, only local internet resources could be used, which were prone to going offline at inoppurtune times. This model allowed the code to be run and tested in about an hour, which allowed for more experimentation with the code.
DATA PREPERATION
The data was prepered using methods from Flovik (2020), although the data for each category was split in a different mannor. The data was divided by a 64%/16%/20% train/validation/test split using methods from Shah (2017). The data needed to be prepared ahead of time into distinct categories for the front, side and rear of the animal. Initially additional categories were attempted, but due to constraints in time and bandwidth, a simplier dataset needed to be created. This involved sorting through the datasets from each of the data sources by hand and placing them into individual folders for each category. The control dataset had just one category for each species.
The final dataset was not a balanced dataset because most pictures of the animals were of the side view rather than views from the front or the rear. Diagonal animals are hard to classify so an effort was made to use pictures with a clear enough difference in the position of the animal. The control dataset ended up having 788 total images. The experimental dataset ended up with fewer images, with 773, and it not clear why this is. Unfortunately, this error was discovered very late in the project’s process and there was not enough time to fix it. In the control dataset, the cow eland had 277 training images, 69 validation images and 86 test images for 432 images total. The bull greater kudu had 257 training images, 64 validation images and 80 test images for 401 images total. The data for the mountain zebra 254 training images, 64 validation images and 80 testing images. The experimental dataset included separate categories for the side, front and rear views of the animal. The side view of the cow eland included 182 training images, 46 validation images and 57 test images for a total of 285 images. The front view of the cow eland had 57 training images, 14 validation images and 18 testing images for a total of 89 images. The rear view of the cow eland had 37 training images, 9 validation images and 12 testing images for a total of 58 images. The side view of the bull greater kudu had 182 training images, 46 validation images and 57 test images for 285 total images. The front view had 46 training images, 11 validation images and 14 testing images for 71 total images. The rear view had 26 training images, 8 validation images and 10 testing images for a total of 44 total images. The mountain zebra’s side view had 165 training images, 41 validation images and 51 testing images for 257 total images. The front view of the mountain zebra had 48 training images 13 validation images and 15 testing images with a total of 75 images. The mountain zebra’s rear view had 28 training images, 7 validation images and 9 testing images.
METHODS:
Google Colabs was used to produce the code for this project as their cyber infrastructure provided the processing power needed to run the convolutional neural networks involved in the project. The VGG16 and ResNet101 architectures were used to create the sequential neural networks to classify the images. The model created by Iftekher Mamun (2019) resizes the images to a size of 224 by 224 pixels, an image size chosen to make the model compatible with the VGG16 architecture. It then converts the images into a numpy array which is loaded into a bottleneck file for efficient use by the training, validation and testing data generators. The data generators then prepare the data for the convolutional network and use a numpy array to generate a set of labels. This code was originally designed to run with 7 epochs and a batch size of 50. However, the number of epochs was increased to 30 in this experiment, and the batch size reduced to 10 as the datasets used were smaller than the animal-10 dataset from Kaggle. The architecture used for the project was the VGG16 algorithm.
Creating the Sequential Neural Network
Like the model originally created for this project, Mamun (2019) used a CNN called a sequential neural network. This used transfer learning to incorporate the VGG16 architecture. The initial layers forming the top of the model was set to false to avoid incorporating it into the model, and the weights were set to ‘imagenet’ as per Mamun (2019). The model flattened the data and included three hidden layers. The first two use a with 100 nodes and the second with 50 nodes, and both use a LeakyRelu activation function. The model incorporates two dropout layers, one with a dropout of 0.5 and one with a dropout of 0.3. A softmax activation function was used on the final hidden layer as it is a classification-based model (Chollet 2018).
Creating the Confusion Matrix
Once the model is complete, the models predictions can be created using the pred function. The next step in Mamun's model involves converting both the labels and predictions from a numpy array into dataframes that the confusion_matrix() function can use (Mamun 2019). Mamun's code also allows the results to be normalized if the matrix values come out as a float. This makes the matrix easier to read. Finally, matplot lib is used to create the graphics for the confusion matrix.
RESULTS
Control Validation Accuracy and Loss:
Control Summary:
The control model ran for 30 epochs and achieved a validation accuracy of 91.88% and a loss of 0.271. Overall, the training and validation accuracy of the model began to plateau around 19 epochs. The testing accuracy for the 93.1% and a loss of 0.3344. The model produced a precision of 0.92 for the eland, 0.90 for the kudu bull and 0.97 for the mountain zebra. Overall, 92% of the images for the eland were classified correctly. 7% of the images were classified as a kudu bull and 1% were classified as a zebra. 95% of the images for the kudu bull were classified correctly, with 4% classified as the eland and 1% classified as a mountain zebra. 93% of the images for the mountain zebra were classified correctly, while 5% were classified as elands and 3% were classified as a bull kudu.
Experimental Training and Validation Accuracy
Experimental Summary
The experimental data produced a validation accuracy of 74.23% and a loss of 1.164. This model appeared to plateau in accuracy and loss at around 25 epochs, but this may reflect a local minimum. With the experimental data, the model produced a test accuracy of 78.19% and a loss of 0.8606. For the eland, the model produced a precision of 0.84 for the side view, 0.60 for the rear view and 0.78 for the front view. For the greater kudu bull, the model produced a precision of 0.85 for the side view, 0.82 for the rear view and 0.86 for the front view. For the mountain zebra, the model produced a precision of 0.81 for the side view, 0.75 for the front and 1.00 for the rear.
Images of the Eland
The confusion matrix revealed that the algorithm correctly identified the side view of the animal more frequently than the front and rear views. The model correctly identified approximately 82% of the photos for the side view of the of the eland, with 9% being identified as front views of the eland, 4% as rear views of the eland and 5% being identified as a bull kudu. The Model correctly identified 67% of the images of the front view of the eland correctly, with 33% being misidentified as the side view. The rear view of the eland fared even worse with only 58% of the images being correctly identified. 25% of these images were misidentified as the rear view and 17% were misidentified as the side view.
Images of the Bull Kudu
The model correctly identified 89% of the photos of the side view of the bull kudu correctly. 5% were misclassified as the side view of an eland, 4% were misclassified as the front view and 2% were misclassified as the side view of a mountain zebra. 64% of the images of the front view of the bull kudu were correctly identified. In contrast, 14% misclassified as the side view of an eland, 14% were misclassified as the side view of the bull kudu and 7% were misclassified as the rear view of the bull kudu. 60% of the images for the rear view of the bull kudu were identified correctly. 30% of these images were misclassified as the side view of the bull kudu and 10% classified as the side view of the eland.
Images of the Mountain Zebra
The model correctly identified 90% of the images for the side view of the mountain zebra. 2% were misclassified as the front view of the zebra, 2% were misclassified as the side view of a bull kudu and 6% were misclassified as the side view of an eland. The algorithm particularly struggled with the front and side views of the mountain zebra, achieving an accuracy of 40% for the front and only 33% for the rear. 33% of the images for the front view of the zebra were misclassified as the front view of the zebra while 27% were surprisingly misclassified as the side view of the eland. 56% of the rear view of the zebra were misclassified as the side view of the zebra while 11% were misclassified as the front view.
DISCUSION AND CONCLUSION
Overall, the algorithm proved to be much more capable of identifying different animals than predicted, and creating different categories for the different angles of the animal actually decreased the accuracy of the model, with images frequently being misclassified as the wrong angle of the animal. This was particularly true of side views, which never achieved an accuracy above 67% (eland front view) and were as low as 40% (mountain zebra front view) and 33% (mountain zebra rear view). When the model misclassified an image as the wrong species, it was usually as the side view of an eland. This included 27% of the images of the front view of a zebra, 14% of the images of the side view of the bull kudu and 10% of the images of the front view of the bull kudu were misclassified as the side view of an eland.
One of the issues that this experiment had was that the data was very unbalanced and this may have impacted the results. Originally image augmentation was supposed to be used with the model. It would have a rotation range of 40, a width shift of 0.25, a height shift of 0.25, flipped images and zoomed in images. However, this code actually produced a lower accuracy with the control data leading to the assumption that it would not improve the accuracy of the test. However, this failed to consider that the data was highly unbalanced in the control group. In hindsight, the purpose of the experiment was to see if the algorithm improves with the different angles, not to generate the most accuracy algorithm possible. It is highly probable that the small sample sizes for the rear and side views may have contributed to their failure in the algorithm.
The confusion matrix for the ResNet101 model for the control group.
The confusion matrix for the ResNet101 model for the experimental group.
The ResNet101 architecture benefits from additional hidden layers and can potentially perform better on more complex datasets (Mamun 2019). However, this model was not able to produce better results, with the control model producing an accuracy of 0.6382 and a loss of 0.7943 and the experimental model producing an accuracy of 0.325 and a loss of 1.943. Control algorithm classified 51% of the images with the eland correctly, with 7% classified as kudu bulls and 42% were classified as a mountain zebra. 46% of kudu bulls were classified correctly by the algorithm, with 40% classified as elands and 14% were classified as mountain zebras. The only animal with a high success rate was the mountain zebra with 90% being classified correctly. Only 1% were classified as a bull kudu and 9% were classified as an eland.
The experimental algorithm struggled with the data. Side views of the eland were classified correctly 54% of the time, with 42% classified as a front view and 4% classified as a side view of a mountain zebra. The front view of the eland was classified correctly 72% of the time, with 28% classified a front view. Rear views of the eland were not classified correctly at all, with 50% classified as a side view, 42% classified as a front view and 8% classified as the side view of a mountain zebra. Only 25% of the photos of the side view of the bull kudu were classified correctly with 2% classified as a front view of a kudu and 74% classified as side or front views of the eland. 21% of the front views of the kudu were classified correctly with 79% being classified as an eland. No images of the rear view of the bull kudu were classified correctly, with 70% being confused with the front view of an eland and 30% being confused with the side view of the eland. Only 49% of the side views of the mountain zebra were classified correctly, with 41% classified as the side view of an eland, 8% as a front view of the eland and 2% as the front view of a bull kudu. The algorithm struggled with the side and rear views of the zebra with no images classified correctly. 86% of the images for the front view of the mountain zebra were classified as an as the side or front view of an eland, while 67% of the rear views of the mountain zebra were classified as the front or side view of an eland. 33% of the rear views of the zebra were classified as a side view. However, the algorithm only ran for 30 epochs, and with additional epochs the algorithm may have performed better.
BIBLIOGRAPHY
Mamun, Iftekher (April 7th, 2019) A Simple CNN: Multi Image Classifier. [Towards Data Science] Retrieved from: https://towardsdatascience.com/a-simple-cnn-multi-image-classifier-31c463324fa
Chollet, Francois (2018) Deep Learning with Python. Manning Publications, Shelter Island, NY.
Shah, Tarang (December 6th, 2017) About Train, Validation and Test Sets in Machine Learning [Towards Data Science] Retrieved from: https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7
Labeled Information Library of Alexandria (LILA BC) (n.d.) Snapshot Kgalagadi (Season 1). Retrieved from: http://lila.science/datasets/snapshot-kgalagadi
Labeled Information Library of Alexandria (LILA BC) (n.d.) Snapshot Camdeboo (Season 1). Retrieved from: http://lila.science/datasets/snapshot-camdeboo
Labeled Information Library of Alexandria (LILA BC) (n.d.) Snapshot Karoo (Season 1). Retrieved from: http://lila.science/datasets/snapshot-karoo
Ahumada, Jorge A., Regraus, Eric, Birch, Tanya, Flores, Nicole, Kays, Roland, O’Brien, Timothy G., Palmer, Jonathan, Palmer, Stephanie, Schuttler, Stephanie, Zhao, Jennifer Y., Jetz, Walter, Kinnaird, Margaret, Kulkarni, Sayali, Lyet, Arnaud, Thau, David, Duong, Michelle, Olive, Ruth, and Dancer, Anthony (2019) Wildlife Insights: A Platform to Maximize the Potential of Camera Trap and Other Passive Sensor Wildlife Data for the Planet. Environmental Conservation page 1 of 6. Doi: 10.1017/S0376892919000298
Flovik, Vegard (February 28th, 2020) Deep Transfer Learning for Images Classification: A step-by-step tutorial from data import to accuracy evaluation [Towards Data Science] Retrieved from: https://towardsdatascience.com/deep-transfer-learning-for-image-classification-f3c7e0ec1a14
Practicum 1: Using QGIS and Support Vector Machines to Differentiate Species of Meadow Jumping Mice (Zapus hudsonius) and Western Jumping Mice (Zapus princeps)
Matthew Clark
Introduction
Colorado is home to two different jumping mouse species, the meadow jumping mouse (Zapus hudsonius) and the western jumping mouse (Zapus princeps). Many of the subspecies of meadow jumping mice including the Preble's meadow jumping mouse (Zapus hudsonius preblei) are listed as endangered (US Fish and Wildlife Service, 2021). Conservation plans for construction along the Front Range frequently have to include plans to avoid or minimize negative effects on the jumping mice. In contrast, the western jumping mouse is thriving. This project will look into whether habitat differences between the two species can be classified by a machine learning algorithm. This will take the form of a support vector machine learner. The goal of the project is to specifically see if the habitat characteristics of meadow jumping mouse habitat can be detected by the SVM. For this reason, the sightings of meadow jumping mice and western jumping mice will be compared to see if the model can tell the differences between them. The US Fish and Wildlife Service recommends that habitat for the Preble's meadow jumping mouse be within 110 meters of a water body, be it a stream, river, pond or lake (Trainor et. al. 2012). However, Trainor et. al. (2012) have found that some mice can be found as far as 340 meters from a body of water as the mice also frequent the grasslands in the vacinity of rivers. These parameters would be important in this project.
Data Sources
Data was acquired from BISON (Biodiversity Information Serving Our Nation) in the form of point shapefiles containing data on the geographic location of the specimens, the institutions who collected the data, taxonomic information and the date in which it was collected (BISON, n.d.). All data points before 1990 were excluded as the areas where this data was found may have been developed by the present day. In addition the data was clipped to just include data points from Colorado. Institutions who contributed to the data include the Denver Museum of Nature and Science, NatureServe Network, the Museum of Southwestern Biology, Fort Hayes Sternberg Museum of Natural History, University of Alaska Museum of the North, iNaturalist.org, Angelo State Natural History Museum, Charles R. Conner Museum and the University of Colorado Museum of Natural History.
Data for land cover was retrieved from the US Geological Survey's 2011 National Land Cover Dataset or NLDS 2011 (United States Geological Survey, 2011). This dataset contains 20 different land cover types. This study included open water, open space, developed areas (low, medium and high), barren ground, deciduous forests, coniferous forests, mixed forests, shrubs, grasslands, pasture, agricultural areas, wooded wetlands and herbaceous emergent grasslands. Data for rivers was acquired from the USGS National Geospatial Program's map, NHD 20200615 for Colorado State or Territory SDshapefile Model Version 2.2.1 (US Geological Survey, 2020). Elevation data was collected from a dataset created by ColoradoView/UV-B Monitoring and Research (n.d.). This dataset consisted of 28 separate raster files representing a digital elevation model (DEM) for the state of Colorado.
QGIS
QGIS software was used to process the data needed for the project. This is the key way in which this project was different from others in that data was not collected via an API, but rather by combining external data sources using GIS software. QGIS is an open source software in comparison to ArcGIS which is very expensive. QGIS allowed the data to be processed remotely without an expensive subscription to ArcGIS. At the beginning of the project, a proof-of-concept model was created to see if it was indeed possible to create the required data on QGIS.
All data needed to be converted to the North America Lambert Conformal Conic projection to ensure that the data was able to line up and overlap properly. A dataset of Colorado Counties was used as a mask to clip the 2011 National Land Cover Database (NLDS 2011) to just Colorado. In the case of the proof-of-concept model, this was just to Douglas County. The river data from the USGS was clipped to just include Colorado counties, as it previously included the entire watersheds in the region around Colorado. This data was composed not just of major rivers, but also the flowlines in each direction (North, South, East, Northeast, Southeast, West, Northwest and Southwest). In order to calculate the distance from streams required this data to be merged. This data was then transformed into a raster using the rasterize function. The finest resolution possible was a 10-square meter resolution. Distances from rivers were then calculated using 10-square meter increments using the proximity function.
The BISON data included sightings of both the western and meadow jumping mice, and these shapefiles needed to be converted to the Lambert Conformal Conic coordinate system and then combined using the merge function. Because the datasets all contained the same columns, this was actually very easy. Buffers were calculated at a distance of 340-meters as per the observations of Trainor et. al. 2012. The zonal histogram function was then used to perform a count of the number of pixels in each land cover type that overlapped with each buffer polygon. These values came from approximately 28 square-meter cells that would later be multiplied by 28 with Pandas to produce the area in square-meters of each habitat type within each buffer. The zonal statistics function was used to find the average distance value in 10x10 square meter pixels from a river within each buffer. This value would later be multiplied by 10 to estimate the average distance in square meters from rivers within the buffer. The zonal statistics function would also perform a similar operation to determine the average elevation in meters within each buffer.
Pandas
numpy
seaborn
numpy
IPython
Matplotlib
sklearn sklearn functions include train_test_split, accuracy_score, svm, metrics, cross_val_score, DecisionTreeClassifier, FactorAnalysis
Preparing the Data
Data preparation involved exporting the data from the final data's attribute table as a .csv file, and then using Pandas to edit the data. This involved creating a list of new names for the subsequent columns in the dataset and applying them with the .columns function. As the data was prepared in QGIS, it was not necessary to fill in columns with missing data. Lambda functions were then used to further edit the data. Some points fell outside of the scope of the NLCD and had a value of NoData in some or all of their buffers. To fix this problem, these cells were multiplied by zero in the lambda functions so that this column would not interfere with data analysis. Each column for land cover in the NLCD had counts of approximately 28 square meter cells, so each column needed to be multiplied by 28. The finest resolution possible for the distance from rivers were 10 square meter cells, so the values in this column needed to be multiplied by 10.
fid | bisonID | ITISsciNme | xcoord | ycoord | nlcdb2011_0 | nlcdb2011_11 | nlcdb2011_21 | nlcdb2011_22 | nlcdb2011_23 | ... | nlcdb2011_42 | nlcdb2011_43 | nlcdb2011_52 | nlcdb2011_71 | nlcdb2011_81 | nlcdb2011_82 | nlcdb2011_90 | nlcdb2011_95 | riverdist_decameters_mean | Ele_meters_mean | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1061284029 | Zapus princeps | -1.048927e+06 | 34400.810906 | 0 | 0 | 0 | 0 | 0 | ... | 263 | 0 | 36 | 0 | 0 | 0 | 0 | 0 | 18.462810 | 0.000000 |
1 | 2 | 897076990 | Zapus princeps | -8.852860e+05 | 161407.789815 | 9 | 0 | 0 | 0 | 0 | ... | 12 | 0 | 116 | 0 | 132 | 0 | 87 | 0 | 13.076882 | 1362.505689 |
2 | 3 | 897077009 | Zapus princeps | -8.852860e+05 | 161407.789815 | 9 | 0 | 0 | 0 | 0 | ... | 12 | 0 | 116 | 0 | 132 | 0 | 87 | 0 | 13.076882 | 1362.505689 |
3 | 4 | 897077028 | Zapus princeps | -8.852860e+05 | 161407.789815 | 9 | 0 | 0 | 0 | 0 | ... | 12 | 0 | 116 | 0 | 132 | 0 | 87 | 0 | 13.076882 | 1362.505689 |
4 | 5 | 1837315390 | Zapus princeps | -8.850253e+05 | 161374.025382 | 7 | 0 | 0 | 0 | 0 | ... | 22 | 0 | 187 | 0 | 54 | 0 | 83 | 0 | 20.020843 | 1372.888781 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
397 | 398 | 1145109121 | Zapus hudsonius | -7.269401e+05 | -19090.725997 | 0 | 0 | 36 | 0 | 0 | ... | 0 | 0 | 96 | 247 | 0 | 0 | 58 | 0 | 13.668359 | 1761.247582 |
398 | 399 | 1145109130 | Zapus hudsonius | -7.269401e+05 | -19090.725997 | 0 | 0 | 36 | 0 | 0 | ... | 0 | 0 | 96 | 247 | 0 | 0 | 58 | 0 | 13.668359 | 1761.247582 |
399 | 400 | 1145109134 | Zapus hudsonius | -7.269401e+05 | -19090.725997 | 0 | 0 | 36 | 0 | 0 | ... | 0 | 0 | 96 | 247 | 0 | 0 | 58 | 0 | 13.668359 | 1761.247582 |
400 | 401 | 1145109154 | Zapus hudsonius | -7.269401e+05 | -19090.725997 | 0 | 0 | 36 | 0 | 0 | ... | 0 | 0 | 96 | 247 | 0 | 0 | 58 | 0 | 13.668359 | 1761.247582 |
401 | 402 | 1145109161 | Zapus hudsonius | -7.269401e+05 | -19090.725997 | 0 | 0 | 36 | 0 | 0 | ... | 0 | 0 | 96 | 247 | 0 | 0 | 58 | 0 | 13.668359 | 1761.247582 |
402 rows × 23 columns
fid | bisonID | Species | xcoord | ycoord | NoData | Open_water | Dev_open_space | Dev_low | Dev_medium | ... | Conifer_forest | Mixed_forest | Shrubland | Grassland | Pasture | Agriculture | Wetlands_woody | Wetlands_herb | River_Distance | Elevation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1061284029 | Zapus princeps | -1.048927e+06 | 34400.810906 | 0 | 0 | 0 | 0 | 0 | ... | 263 | 0 | 36 | 0 | 0 | 0 | 0 | 0 | 18.462810 | 0.000000 |
1 | 2 | 897076990 | Zapus princeps | -8.852860e+05 | 161407.789815 | 9 | 0 | 0 | 0 | 0 | ... | 12 | 0 | 116 | 0 | 132 | 0 | 87 | 0 | 13.076882 | 1362.505689 |
2 | 3 | 897077009 | Zapus princeps | -8.852860e+05 | 161407.789815 | 9 | 0 | 0 | 0 | 0 | ... | 12 | 0 | 116 | 0 | 132 | 0 | 87 | 0 | 13.076882 | 1362.505689 |
3 | 4 | 897077028 | Zapus princeps | -8.852860e+05 | 161407.789815 | 9 | 0 | 0 | 0 | 0 | ... | 12 | 0 | 116 | 0 | 132 | 0 | 87 | 0 | 13.076882 | 1362.505689 |
4 | 5 | 1837315390 | Zapus princeps | -8.850253e+05 | 161374.025382 | 7 | 0 | 0 | 0 | 0 | ... | 22 | 0 | 187 | 0 | 54 | 0 | 83 | 0 | 20.020843 | 1372.888781 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
397 | 398 | 1145109121 | Zapus hudsonius | -7.269401e+05 | -19090.725997 | 0 | 0 | 36 | 0 | 0 | ... | 0 | 0 | 96 | 247 | 0 | 0 | 58 | 0 | 13.668359 | 1761.247582 |
398 | 399 | 1145109130 | Zapus hudsonius | -7.269401e+05 | -19090.725997 | 0 | 0 | 36 | 0 | 0 | ... | 0 | 0 | 96 | 247 | 0 | 0 | 58 | 0 | 13.668359 | 1761.247582 |
399 | 400 | 1145109134 | Zapus hudsonius | -7.269401e+05 | -19090.725997 | 0 | 0 | 36 | 0 | 0 | ... | 0 | 0 | 96 | 247 | 0 | 0 | 58 | 0 | 13.668359 | 1761.247582 |
400 | 401 | 1145109154 | Zapus hudsonius | -7.269401e+05 | -19090.725997 | 0 | 0 | 36 | 0 | 0 | ... | 0 | 0 | 96 | 247 | 0 | 0 | 58 | 0 | 13.668359 | 1761.247582 |
401 | 402 | 1145109161 | Zapus hudsonius | -7.269401e+05 | -19090.725997 | 0 | 0 | 36 | 0 | 0 | ... | 0 | 0 | 96 | 247 | 0 | 0 | 58 | 0 | 13.668359 | 1761.247582 |
402 rows × 23 columns
fid | bisonID | Species | xcoord | ycoord | NoData | Open_water | Dev_open_space | Dev_low | Dev_medium | ... | Conifer_forest | Mixed_forest | Shrubland | Grassland | Pasture | Agriculture | Wetlands_woody | Wetlands_herb | River_Distance | Elevation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1061284029 | Zapus princeps | -1.048927e+06 | 34400.810906 | 0 | 0 | 0 | 0 | 0 | ... | 7364 | 0 | 1008 | 0 | 0 | 0 | 0 | 0 | 184.628104 | 0.000000 |
1 | 2 | 897076990 | Zapus princeps | -8.852860e+05 | 161407.789815 | 0 | 0 | 0 | 0 | 0 | ... | 336 | 0 | 3248 | 0 | 3696 | 0 | 2436 | 0 | 130.768821 | 1362.505689 |
2 | 3 | 897077009 | Zapus princeps | -8.852860e+05 | 161407.789815 | 0 | 0 | 0 | 0 | 0 | ... | 336 | 0 | 3248 | 0 | 3696 | 0 | 2436 | 0 | 130.768821 | 1362.505689 |
3 | 4 | 897077028 | Zapus princeps | -8.852860e+05 | 161407.789815 | 0 | 0 | 0 | 0 | 0 | ... | 336 | 0 | 3248 | 0 | 3696 | 0 | 2436 | 0 | 130.768821 | 1362.505689 |
4 | 5 | 1837315390 | Zapus princeps | -8.850253e+05 | 161374.025382 | 0 | 0 | 0 | 0 | 0 | ... | 616 | 0 | 5236 | 0 | 1512 | 0 | 2324 | 0 | 200.208428 | 1372.888781 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
397 | 398 | 1145109121 | Zapus hudsonius | -7.269401e+05 | -19090.725997 | 0 | 0 | 1008 | 0 | 0 | ... | 0 | 0 | 2688 | 6916 | 0 | 0 | 1624 | 0 | 136.683588 | 1761.247582 |
398 | 399 | 1145109130 | Zapus hudsonius | -7.269401e+05 | -19090.725997 | 0 | 0 | 1008 | 0 | 0 | ... | 0 | 0 | 2688 | 6916 | 0 | 0 | 1624 | 0 | 136.683588 | 1761.247582 |
399 | 400 | 1145109134 | Zapus hudsonius | -7.269401e+05 | -19090.725997 | 0 | 0 | 1008 | 0 | 0 | ... | 0 | 0 | 2688 | 6916 | 0 | 0 | 1624 | 0 | 136.683588 | 1761.247582 |
400 | 401 | 1145109154 | Zapus hudsonius | -7.269401e+05 | -19090.725997 | 0 | 0 | 1008 | 0 | 0 | ... | 0 | 0 | 2688 | 6916 | 0 | 0 | 1624 | 0 | 136.683588 | 1761.247582 |
401 | 402 | 1145109161 | Zapus hudsonius | -7.269401e+05 | -19090.725997 | 0 | 0 | 1008 | 0 | 0 | ... | 0 | 0 | 2688 | 6916 | 0 | 0 | 1624 | 0 | 136.683588 | 1761.247582 |
402 rows × 23 columns
Coding for Data Exploration
Data exploration was done with the Seaborn, Matplotlib and IPython modules. Creating these plots relied on seperating numerical data for land cover types, elevations and river distances using the .iloc[] function from Pandas. The pair plot was created Seaborn's pairplot() function. In contrast, the heat map involved converting the dataframe into an array, and then using Seaborn's heatmap function combined with a corr() function using techniques from Anita (2019).
Species | Dev_open_space | Dev_low | Dev_medium | Dev_high | Barren_land | Deci_forest | Conifer_forest | Mixed_forest | Shrubland | Grassland | Pasture | Agriculture | Wetlands_woody | Wetlands_herb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Zapus princeps | 0 | 0 | 0 | 0 | 0 | 3948 | 7364 | 0 | 1008 | 0 | 0 | 0 | 0 | 0 |
1 | Zapus princeps | 0 | 0 | 0 | 0 | 0 | 2156 | 336 | 0 | 3248 | 0 | 3696 | 0 | 2436 | 0 |
2 | Zapus princeps | 0 | 0 | 0 | 0 | 0 | 2156 | 336 | 0 | 3248 | 0 | 3696 | 0 | 2436 | 0 |
3 | Zapus princeps | 0 | 0 | 0 | 0 | 0 | 2156 | 336 | 0 | 3248 | 0 | 3696 | 0 | 2436 | 0 |
4 | Zapus princeps | 0 | 0 | 0 | 0 | 0 | 2324 | 616 | 0 | 5236 | 0 | 1512 | 0 | 2324 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
397 | Zapus hudsonius | 1008 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2688 | 6916 | 0 | 0 | 1624 | 0 |
398 | Zapus hudsonius | 1008 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2688 | 6916 | 0 | 0 | 1624 | 0 |
399 | Zapus hudsonius | 1008 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2688 | 6916 | 0 | 0 | 1624 | 0 |
400 | Zapus hudsonius | 1008 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2688 | 6916 | 0 | 0 | 1624 | 0 |
401 | Zapus hudsonius | 1008 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2688 | 6916 | 0 | 0 | 1624 | 0 |
402 rows × 15 columns
Open_water | Dev_open_space | Dev_low | Dev_medium | Dev_high | Barren_land | Deci_forest | Conifer_forest | Mixed_forest | Shrubland | Grassland | Pasture | Agriculture | Wetlands_woody | Wetlands_herb | River_Distance | Elevation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 3948 | 7364 | 0 | 1008 | 0 | 0 | 0 | 0 | 0 | 184.628104 | 0.000000 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 2156 | 336 | 0 | 3248 | 0 | 3696 | 0 | 2436 | 0 | 130.768821 | 1362.505689 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 2156 | 336 | 0 | 3248 | 0 | 3696 | 0 | 2436 | 0 | 130.768821 | 1362.505689 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 2156 | 336 | 0 | 3248 | 0 | 3696 | 0 | 2436 | 0 | 130.768821 | 1362.505689 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 2324 | 616 | 0 | 5236 | 0 | 1512 | 0 | 2324 | 0 | 200.208428 | 1372.888781 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
397 | 0 | 1008 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2688 | 6916 | 0 | 0 | 1624 | 0 | 136.683588 | 1761.247582 |
398 | 0 | 1008 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2688 | 6916 | 0 | 0 | 1624 | 0 | 136.683588 | 1761.247582 |
399 | 0 | 1008 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2688 | 6916 | 0 | 0 | 1624 | 0 | 136.683588 | 1761.247582 |
400 | 0 | 1008 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2688 | 6916 | 0 | 0 | 1624 | 0 | 136.683588 | 1761.247582 |
401 | 0 | 1008 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2688 | 6916 | 0 | 0 | 1624 | 0 | 136.683588 | 1761.247582 |
402 rows × 17 columns
Results for Data Explortation
The shear number of variables made interpreting the pairplot difficult. Many variables had skewed distributions due to the high number of zero values. It is hoped that the large sample sizes will be sufficient to balance for this effect. Strong relationships between the different development types were detected in both the correlation matrix with a heatmap and with the pairplot. The heat map in particular showed a strong relationship between open space development and low developed areas, medium development and high development areas and between river distance and grasslands. The final correlation is likely due to the fact that many Preble's jumping mouse sightings were found in grasslands within 320 meters from water, as observed in Trainor et. al. (2012). A correlation between mixed forest and decidous forest was also observed. Coniferous forest correlated heavily in a positive way to elevation and strongly in a negative way to grasslands. This is not surprising as the two habitat types are found at different elevations.
<AxesSubplot:>
Open_water | Dev_open_space | Dev_low | Dev_medium | Dev_high | Barren_land | Deci_forest | Conifer_forest | Mixed_forest | Shrubland | Grassland | Pasture | Agriculture | Wetlands_woody | Wetlands_herb | River_Distance | Elevation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Open_water | 1.000000 | -0.036538 | 0.021182 | 0.004826 | -0.006283 | 0.144318 | -0.018428 | -0.020311 | -0.011812 | -0.080906 | -0.045913 | 0.028194 | 0.095883 | 0.135825 | -0.016493 | -0.056878 | 0.042670 |
Dev_open_space | -0.036538 | 1.000000 | 0.569501 | 0.375584 | 0.005372 | 0.092142 | -0.240566 | -0.154506 | -0.166893 | -0.188156 | -0.055796 | 0.069983 | 0.062039 | 0.216518 | 0.440834 | -0.166596 | -0.155284 |
Dev_low | 0.021182 | 0.569501 | 1.000000 | 0.679961 | 0.145106 | -0.019672 | -0.177684 | -0.250567 | -0.100186 | -0.110592 | -0.061569 | 0.009865 | 0.235022 | 0.081827 | 0.390990 | -0.125091 | -0.210519 |
Dev_medium | 0.004826 | 0.375584 | 0.679961 | 1.000000 | 0.644879 | -0.005309 | -0.135071 | -0.198269 | -0.075286 | -0.132820 | -0.105259 | 0.006409 | 0.055239 | 0.044841 | 0.170482 | -0.044876 | -0.150872 |
Dev_high | -0.006283 | 0.005372 | 0.145106 | 0.644879 | 1.000000 | -0.010526 | -0.058957 | -0.086835 | -0.029668 | -0.089826 | -0.079595 | -0.025843 | -0.000749 | -0.041685 | -0.030765 | 0.023966 | -0.072088 |
Barren_land | 0.144318 | 0.092142 | -0.019672 | -0.005309 | -0.010526 | 1.000000 | -0.035945 | 0.050467 | -0.029884 | -0.096226 | -0.048178 | -0.027834 | -0.007122 | -0.009112 | -0.034465 | -0.050859 | 0.204099 |
Deci_forest | -0.018428 | -0.240566 | -0.177684 | -0.135071 | -0.058957 | -0.035945 | 1.000000 | -0.079970 | 0.506314 | -0.080500 | -0.344262 | -0.015278 | -0.036849 | -0.101884 | -0.192160 | -0.211640 | 0.290375 |
Conifer_forest | -0.020311 | -0.154506 | -0.250567 | -0.198269 | -0.086835 | 0.050467 | -0.079970 | 1.000000 | 0.046850 | -0.105764 | -0.546043 | -0.135393 | -0.056068 | -0.067948 | -0.283852 | -0.167159 | 0.515689 |
Mixed_forest | -0.011812 | -0.166893 | -0.100186 | -0.075286 | -0.029668 | -0.029884 | 0.506314 | 0.046850 | 1.000000 | -0.188500 | -0.229213 | 0.010897 | -0.018883 | -0.091582 | -0.096215 | -0.123846 | 0.236488 |
Shrubland | -0.080906 | -0.188156 | -0.110592 | -0.132820 | -0.089826 | -0.096226 | -0.080500 | -0.105764 | -0.188500 | 1.000000 | -0.289694 | -0.073769 | -0.058049 | -0.199858 | -0.132898 | -0.380409 | -0.333746 |
Grassland | -0.045913 | -0.055796 | -0.061569 | -0.105259 | -0.079595 | -0.048178 | -0.344262 | -0.546043 | -0.229213 | -0.289694 | 1.000000 | -0.182204 | -0.030132 | -0.168646 | 0.119018 | 0.680129 | -0.261759 |
Pasture | 0.028194 | 0.069983 | 0.009865 | 0.006409 | -0.025843 | -0.027834 | -0.015278 | -0.135393 | 0.010897 | -0.073769 | -0.182204 | 1.000000 | -0.008852 | 0.304479 | -0.016719 | -0.094673 | -0.129430 |
Agriculture | 0.095883 | 0.062039 | 0.235022 | 0.055239 | -0.000749 | -0.007122 | -0.036849 | -0.056068 | -0.018883 | -0.058049 | -0.030132 | -0.008852 | 1.000000 | 0.172482 | 0.010176 | -0.041043 | -0.076413 |
Wetlands_woody | 0.135825 | 0.216518 | 0.081827 | 0.044841 | -0.041685 | -0.009112 | -0.101884 | -0.067948 | -0.091582 | -0.199858 | -0.168646 | 0.304479 | 0.172482 | 1.000000 | 0.161338 | -0.203844 | -0.079098 |
Wetlands_herb | -0.016493 | 0.440834 | 0.390990 | 0.170482 | -0.030765 | -0.034465 | -0.192160 | -0.283852 | -0.096215 | -0.132898 | 0.119018 | -0.016719 | 0.010176 | 0.161338 | 1.000000 | -0.185898 | -0.181274 |
River_Distance | -0.056878 | -0.166596 | -0.125091 | -0.044876 | 0.023966 | -0.050859 | -0.211640 | -0.167159 | -0.123846 | -0.380409 | 0.680129 | -0.094673 | -0.041043 | -0.203844 | -0.185898 | 1.000000 | -0.186060 |
Elevation | 0.042670 | -0.155284 | -0.210519 | -0.150872 | -0.072088 | 0.204099 | 0.290375 | 0.515689 | 0.236488 | -0.333746 | -0.261759 | -0.129430 | -0.076413 | -0.079098 | -0.181274 | -0.186060 | 1.000000 |
Creating the SVM model and Displaying the Results
The actual SVM model was created using code from the YouTube channel CMS WisCon (April 30, 2020). This source was chosen because this was the first time I created an SVM using Python, and I was having difficulty creating the model from a Pandas dataframe. The first step involved creating a random seed for the model using numpy's random.seed() function. Then the model was converted to an array using the .values function. Testing and training datasets were created from the original dataframe. The training data including values for the land area in square meters of different habitat types, the distance in rivers in meters and the elevation of the habitat. Latitude and longitude were left out due to the great difference in orders of magnitude of the data. The colunm for species was used for the dependent variable as it contained the species of the mouse included in the data point. The train_test_split() function from sklearn was used to split both the independent variable and the labels for the dependent variable into train and testing sets. The SVC() from sklearn was used to construct the support vector machine. Versions of the SVM were also created with linear, polynomial and radial basis function kernals. The models were fit with the .fit function, and testd using the .predict() function. The accuracy of the model was printed using the accuracy_score() function. The results of the data were displayed with a confusion matrix created from code by website Edpresso (2021) and sklearn's metrics package. These included a confusion matrix (metrics.confusion_matrix) and a classification report with precision, recall, f1-scores and support (metrics.classification_report). Results from the support vectors machines were graphed in dataframes created using Python and R Tips (2018).
Generic SVM
The first model to be tested was a generic SVM model with default settings. The generic SVM model produced an accuracy value of 0.893. Overall, the model had a precision of 0.89 and a recall of 0.88. The model had a precision of 0.90 for meadow jumping mice (Zapus hudsonius) and 0.89 for western jumping mice (Zapus princeps). There were 8 western jumping mice misclassified as meadow jumping mice (false negatives) and 5 meadow jumping mice misclassified as western jumping mice (false positives). This resulted in a lower recall score for western jumping mice (0.83) in comparison to meadow jumping mice (0.93). The generic SVM model produced a Cohen's kappa score of 0.7712, indicating a good agreement between the predicted values and the true values (Lantz, 2015, pg. 323).
Z.hudsonius | Z.princeps | |
---|---|---|
0 | 69 | 5 |
1 | 8 | 39 |
Metric | macro_avg | |
---|---|---|
0 | Accuracy | 0.893 |
1 | Precision | 0.890 |
2 | Recall | 0.880 |
3 | f1-score | 0.890 |
4 | Kappa | 0.770 |
Linear Kernel SVM
The SVM with a linear kernel produced a considerably lower accuracy with a value of 0.71. This model produced a precision value of 0.73 and a recall value. This model misclassified 28 meadow jumping mice as western jumping mice, and 7 western jumping mice as meadow jumping mice. The model had a precision value of 0.87 for meadow jumping mice and a value of 0.59 for western jumping mice. This model produced a recall value of 0.62 for meadow jumping mice and a value of 0.85 for western jumping mice. The linear SVM model produced a Cohen's kappa score of 0.4371, indicating only a moderate agreement between the predicted values and the true values (Lantz, 2015, pg. 323).
Z.hudsonius | Z.princeps | |
---|---|---|
0 | 46 | 28 |
1 | 7 | 40 |
Metric | macro_avg | |
---|---|---|
0 | Accuracy | 0.711 |
1 | Precision | 0.730 |
2 | Recall | 0.740 |
3 | f1-score | 0.710 |
4 | Kappa | 0.437 |
Polynomial Kernel SVM
The SVM with a polynomial kernel produced an accuracy of 0.893, a precision value of 0.92 and a recall of 0.87. The model produced a precision of 0.86 for meadow jumping mice and a precision of 0.97 for western jumping mice. Recall values were 0.99 for meadow jumping mice and 0.74. This model misclassified 12 western jumping mice as meadow jumping mice and 1 meadow jumping mouse as a western jumping mouse. The polynomial SVM model produced a Cohen's kappa score of 0.7638, indicating a good agreement between the predicted values and the true values (Lantz, 2015, pg. 323).
Z.hudsonius | Z.princeps | |
---|---|---|
0 | 73 | 1 |
1 | 12 | 35 |
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Metric | macro_avg | |
---|---|---|
0 | Accuracy | 0.893 |
1 | Precision | 0.920 |
2 | Recall | 0.870 |
3 | f1-score | 0.880 |
4 | kappa | 0.764 |
SVM with a Radial Basis Function
This model produced an accuracy of 0.893, and an average precision of 0.89, and an average weighted precision of 0.88. Precision values for meadow jumping mice are 0.90 and western jumping mice are 0.89. Recall values were 0.93 for meadow jumping mice and 0.83 for western jumping mice. This model misclassified 8 western jumping mice as meadow jumping mice, and it misclassified 5 meadow jumping mice as western jumping mice. The radial-basis SVM model produced a Cohen's kappa score of 0.7712, indicating a good agreement between the predicted values and the true values (Lantz, 2015, pg. 323).
Z.hudsonius | Z.princeps | |
---|---|---|
0 | 69 | 5 |
1 | 8 | 39 |
Metric | macro_avg | |
---|---|---|
0 | Accuracy | 0.893 |
1 | Precision | 0.890 |
2 | Recall | 0.880 |
3 | f1-score | 0.890 |
4 | kappa | 0.770 |
Removing variables with high correlation from the Model
This version of the model removed several variables with a high amount of correlation. Low development had a high degree of correlation with medium development. These variables included low development, medium development, mixed forest, elevation and river distance. It turned out that elevation was associated with different habitats, with low elevations being associated with grasslands and high elevations being associated with conifer forests. As a result, elevation was not necessary. Surprisingly grasslands were strongly correlated with river distance as many meadow jumping mouse sightings were located near rivers.
Generic SVM with Fewer Variables
This model used the generic SVM, but dropped the river distance, elevation, low development, medium development and mixed forest variables in an attempt to make a less complicated model. This model ended up producing identical results to the generic model with all attributes intact. Its accuracy was 0.8903, its precision was 0.89 and its recall was 0.88.This model produced a Cohen's kappa score of 0.7712, indicating a good agreement between the predicted values and the true values (Lantz, 2015, pg. 323).
Species | Dev_open_space | Dev_high | Barren_land | Deci_forest | Conifer_forest | Shrubland | Grassland | Pasture | Agriculture | Wetlands_woody | Wetlands_herb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Zapus princeps | 0 | 0 | 0 | 3948 | 7364 | 1008 | 0 | 0 | 0 | 0 | 0 |
1 | Zapus princeps | 0 | 0 | 0 | 2156 | 336 | 3248 | 0 | 3696 | 0 | 2436 | 0 |
2 | Zapus princeps | 0 | 0 | 0 | 2156 | 336 | 3248 | 0 | 3696 | 0 | 2436 | 0 |
3 | Zapus princeps | 0 | 0 | 0 | 2156 | 336 | 3248 | 0 | 3696 | 0 | 2436 | 0 |
4 | Zapus princeps | 0 | 0 | 0 | 2324 | 616 | 5236 | 0 | 1512 | 0 | 2324 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
397 | Zapus hudsonius | 1008 | 0 | 0 | 0 | 0 | 2688 | 6916 | 0 | 0 | 1624 | 0 |
398 | Zapus hudsonius | 1008 | 0 | 0 | 0 | 0 | 2688 | 6916 | 0 | 0 | 1624 | 0 |
399 | Zapus hudsonius | 1008 | 0 | 0 | 0 | 0 | 2688 | 6916 | 0 | 0 | 1624 | 0 |
400 | Zapus hudsonius | 1008 | 0 | 0 | 0 | 0 | 2688 | 6916 | 0 | 0 | 1624 | 0 |
401 | Zapus hudsonius | 1008 | 0 | 0 | 0 | 0 | 2688 | 6916 | 0 | 0 | 1624 | 0 |
402 rows × 12 columns
Z.hudsonius | Z.princeps | |
---|---|---|
0 | 69 | 5 |
1 | 8 | 39 |
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Metric | macro_avg | |
---|---|---|
0 | Accuracy | 0.893 |
1 | Precision | 0.890 |
2 | Recall | 0.880 |
3 | f1-score | 0.890 |
4 | kappa | 0.771 |
Polynomial SVM with Fewer Variables
This model was similar to the generic SVM with fewer variables, but it used a polynomial kernel. This model misclassified 7 meadow jumping mice as western jumping mice, and it misclassified 9 western jumping mice as meadow jumping mice. The model's total accuracy was 0.868, with a precision value of 0.86 and a recall value of 0.86. The model produced a precision of 0.88 for the meadow jumping mouse and 0.84 for the western jumping mouse. It produced a recall value of 0.91 for the meadow jumping mouse and a recall of 0.81 for the western jumping mouse. The SVM model produced a Cohen's kappa score of 0.720, indicating a good agreement between the predicted values and the true values (Lantz, 2015, pg. 323).
Z.hudsonius | Z.princeps | |
---|---|---|
0 | 67 | 7 |
1 | 9 | 38 |
Metric | macro_avg | |
---|---|---|
0 | Accuracy | 0.868 |
1 | Precision | 0.860 |
2 | Recall | 0.860 |
3 | f1-score | 0.860 |
4 | kappa | 0.720 |
Decision Tree Models
In addition to the SVM models, a decision tree was also created as these algorithms tend to be good at dealing with unbalanced (Boyle 2019). The actual decision tree model was created using code from Raschka and Mirjalili (2019) pg. 96. The graphic for the decision tree was created using code from Ptonski (2020).
Decision Tree
The final model was a decision tree model with a Gini index and a max depth of 4. This model produced an accuracy of 0.901, with a precision of 0.93 and a recall of 0.90. For meadow jumping mice the model produced a precision of 0.89, a recall of 0.99 and misclassified 1 meadow jumping mouse as a western jumping mouse. For western jumping mice, the model produced a precision of 0.97 and a recall value of 0.81. It misclassified 9 western jumping mice as meadow jumping mice. The decision tree model produced a Cohen's kappa score of 0.820, indicating a good agreement between the predicted values and the true values (Lantz, 2015, pg. 323). Coniferous forests proved to be an important factor in classifying the mice. This makes sense as the western jumping mouse is more likely to be found in coniferous forests. Surprisingly, shrublands and barren land were also important factors in classifying the mice.
Z.hudsonius | Z.princeps | |
---|---|---|
0 | 73 | 1 |
1 | 9 | 38 |
Metric | macro_avg | |
---|---|---|
0 | Accuracy | 0.901 |
1 | Precision | 0.930 |
2 | Recall | 0.900 |
3 | f1-score | 0.910 |
4 | kappa | 0.820 |
Model | True Positives | False Positives | False Negatives | True Negatives | Accuracy | Precision | Recall | F-1-Score | Kappa | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Generic | 69 | 8 | 5 | 39 | 0.893 | 0.89 | 0.88 | 0.89 | 0.772 |
1 | Linear | 46 | 7 | 28 | 40 | 0.710 | 0.73 | 0.74 | 0.71 | 0.437 |
2 | Polynomial | 73 | 12 | 1 | 35 | 0.893 | 0.92 | 0.87 | 0.88 | 0.764 |
3 | Radial-Basis Function | 69 | 8 | 5 | 39 | 0.893 | 0.89 | 0.88 | 0.89 | 0.771 |
4 | Reduced | 69 | 8 | 5 | 39 | 0.893 | 0.89 | 0.88 | 0.89 | 0.771 |
5 | Reduced Polynomial | 67 | 9 | 7 | 38 | 0.868 | 0.86 | 0.86 | 0.86 | 0.720 |
6 | Decision Tree | 73 | 9 | 1 | 38 | 0.901 | 0.93 | 0.90 | 0.91 | 0.820 |
Model | Precision_hudsonius | Precision_princeps | Recall_hudsonius | Recall_princeps | F-1-Score_hudsonius | F-1-Score_princeps | |
---|---|---|---|---|---|---|---|
0 | Generic SVM | 0.90 | 0.89 | 0.93 | 0.83 | 0.91 | 0.86 |
1 | Linear SVM | 0.87 | 0.59 | 0.62 | 0.85 | 0.72 | 0.70 |
2 | Polynomial SVM | 0.86 | 0.97 | 0.99 | 0.74 | 0.92 | 0.84 |
3 | Radial-Basis Function SVM | 0.90 | 0.89 | 0.93 | 0.83 | 0.91 | 0.86 |
4 | Reduced Generic SVM | 0.90 | 0.89 | 0.93 | 0.83 | 0.91 | 0.86 |
5 | Reduced Polynomial SVM | 0.88 | 0.84 | 0.91 | 0.81 | 0.89 | 0.83 |
6 | Decision Tree | 0.89 | 0.97 | 0.99 | 0.81 | 0.94 | 0.88 |
Comparing Models
The model with th highest F1-score was the decision tree model with 0.91. The generic SVM, reduced SVM and radial-basis functions all produced values of 0.89. The polynomial model produced a value of 0.88. The model with the lowest F1-score was the linear model with 0.71. The next lowest was the polynomial SVM with a reduced number of variables. This was a value of 0.86.
Discussion
Based on F1 scores, the best model was the decision tree model, but generic model, the generic model with reduced variables and the radial-basis function were not far behind with 0.89. Overall there wasn't a large difference between models except for the linear model. The linear model performed the worst with 0.71, which is not surprising considering the data is binomial. When calculating Cohen's Kappa values, the linear model performed the worst, as expected with only a moderate level of agreement (k = 0.4371) (Lantz, 2015, pg. 323). The decision tree model performed the best (k = 0.820), indicating a very good agreement between predicted and actual values. The other models ranged from k-valued of 0.720 (reduced variable polynomial SVM) to 0.7712 (generic SVM, reduced variable SVM and the radial-basis kernel SVM). These models had a good agreement between the predicted and actual values.
Removing redundant variables had little effect on the overall effectiveness of the model. In fact, it actually produced a slightly poorer performance in the polynomial model (F1 = 0.868). False negatives are more serious in this model as H. hudsonius has several threatened subspecies in Colorado. So classifying an H. hudsonius as an H. princeps (a false negative) is more serious than classifying an H. princeps as an H. hudsonius (a false positive). The model's with the fewest number of misclassified meadow jumping mice were the polynomial SVM and the decision tree models with a single misclassified meadow jumping mouse.
The models in this project had two major issues. The first was that the data was unbalanced, which created problems in classification. Overall, there were considerably fewer data points for western jumping mice, which likely contributed to the higher number of western jumping mice being being misclassified in most models. The other major issue was the small size of the data. In most wildlife biology studies, datasets greater than 30 samples are often considered to be large enough. However, in machine learning this can create issues as most models rely on higher amounts of data. In hindsight, testing the model on species with more data points may have produced better results.
Several variables did have redundancies. Elevation was strongly correlated with both grasslands and coniferous forests. This makes sense because coniferous forests are found at higher elevations than grasslands. There was also a strong correlation between river distances and grasslands. This makes sense because the meadow jumping mouse is generally found in the wetlands and grasslands found within 340-meters of water (Trainor et. al. 2012).
Using K-fold Cross Validation
K-fold cross validation is frequently used as a technique for dealing with unbalanced data (Boyle 2019). K-fold cross validation was developed from code by scikit-learn developers (2020) and was used on the two reduced models. However, these models were not able to produce higher values for accuracy and F1-values than their counterparts. The most promising of these models was the cross-validated polynomial SVM, which produced a mean accuracy of 0.878 and a mean F1-value of 0.870. Surprisingly the decision tree model with cross-fold validation did not outperform the polynomial SVM. It had an accuracy of 0.876 and an F-1 score of 0.865.
Model | mean accuracy | F-1 Score | |
---|---|---|---|
0 | K-fold SVM | 0.851 | 0.843 |
1 | K-fold Polynomial SVM | 0.878 | 0.870 |
2 | K-fold Decision Tree | 0.876 | 0.865 |
Possible Concerns
One of the concerns with this model is that it might be good at identifying the meadow jumping mice simply because there are so many samples of them in the data. Nearly all of the models had issues with identifying the western jumping mice with 7-9 of them being misclassified as meadow jumping mice. In the future, this technique should be tested on species with more data points to see if the misclassification issues are real or simply a product of small sample sizes. In addition, techniques such as under-sampling the larger class could be used to try and make the data more balanced between samples.
Bibliography
US Fish and Wildlife Service (January 6th, 2021) Preble’s Meadow Jumping Mouse Retrieved from: https://www.fws.gov/mountain-prairie/es/preblesMeadowJumpingMouse.php
Trainor, Anne M., Shenk, Tanya M. and Wilson, Kenneth R. (2012) Spatial, temporal, and biological factors associated with Prebles meadow jumping mouse (Zapus hudsonius preblei) home range. Journal of Mammalogy. 93(2), pgs. 429-438. Doi: 10.1644/11-MAMM-A-049.1
Python and R Tips (January 10, 2018) How to Create Pandas Dataframe from Multiple Lists? Pandas Tutorial. [Blog] Retrieved from: https://cmdlinetips.com/2018/01/how-to-create-pandas-dataframe-from-multiple-lists/
CMS WisCon(April 30, 2020) SVM Classifier in Python on Real Data Set [YouTube] Retrieved from: https://www.youtube.com/watch?v=Vv5U0kjYebM
Edpresso (2021) How to create a confusion matrix in Python using scikit-learn [Blog] Retrieved from: https://www.educative.io/edpresso/how-to-create-a-confusion-matrix-in-python-using-scikit-learn
Anita, Okoh (August 20th, 2019) Seaborn Heatmaps: 13 Ways to Customize Correlation Matrix Visualizations [Heartbeat] Retrieved from: https://heartbeat.fritz.ai/seaborn-heatmaps-13-ways-to-customize-correlation-matrix-visualizations-f1c49c816f07
scikit-learn developers (2007-2020) 3.1. Cross-validation: evaluating estimator performance. Retrieved from: https://scikit-learn.org/stable/modules/cross_validation.html
Boyle, Tara (February 3rd, 2019) Dealing with Imbalanced Data: A guide to effectivly handling imbalanced datasets in Python. [Towards Data Science] Retrieved from: https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18
Raschka, Sebastian and Vahid Mirjalili (2019) Python Machine Learning Third Edition. Packt Publishing Ltd.
Chen, Daniel y. (2018) Pandas for Everyone: Python Data Analysis. Pearson Education, Inc.
Lantz, Brett () Machine Learning with R: Second Edition. Packt Publishing Ltd.
Ptonski, Piotr (June 22, 2020) Visualize a Decision Tree in 4 Ways with Scikit-Learn and Python [mljar] Retrieved from: https://mljar.com/blog/visualize-decision-tree/
DMNS Mammal Collection (Arctos). Denver Museum of Nature and Science. Accessed through Biodiversity Information Serving Our Nation (BISON) (n.d.) Zapus hudsonius prebli [Data File] Retrieved from https://bison.usgs.gov/#home on 1/16/2021
NatureServe Network Species Occurrence Data. Accessed through Biodiversity Information Serving Our Nation (BISON) (n.d.) Zapus hudsonius prebli [Data File] Retrieved from https://bison.usgs.gov/#home on 1/16/2021
Museum of Southwestern Biology. Accessed through Biodiversity Information Serving Our Nation (BISON) (n.d.) Zapus hudsonius prebli [Data File] Retrieved from https://bison.usgs.gov/#home on 1/16/2021
Fort Hayes Sternberg Museum of Natural History. Accessed through Biodiversity Information Serving Our Nation (BISON) (n.d.) Zapus hudsonius prebli [Data File] Retrieved from https://bison.usgs.gov/#home on 1/16/2021
University of Alaska Museum of the North. Accessed through Biodiversity Information Serving Our Nation (BISON) (n.d.) Zapus hudsonius prebli [Data File] Retrieved from https://bison.usgs.gov/#home on 1/16/2021
iNaturalist.org.Accessed through Biodiversity Information Serving Our Nation (BISON) (n.d.) Zapus hudsonius prebli [Data File] Retrieved from https://bison.usgs.gov/#home on 1/16/2021
Angelo State Natural History Museum (ASNHC).Accessed through Biodiversity Information Serving Our Nation (BISON) (n.d.) Zapus hudsonius prebli [Data File] Retrieved from https://bison.usgs.gov/#home on 1/16/2021
Charles R. Conner Museum.Accessed through Biodiversity Information Serving Our Nation (BISON) (n.d.) Zapus hudsonius prebli [Data File] Retrieved from https://bison.usgs.gov/#home on 1/16/2021
University of Colorado Museum of Natural History.Accessed through Biodiversity Information Serving Our Nation (BISON) (n.d.) Zapus hudsonius prebli [Data File] Retrieved from https://bison.usgs.gov/#home on 1/16/2021
US Geological Survey. US Department of the Interior. (2014). National Land Cover Database 2011 (NLCD2011). [Raster]. Multi-Resolution Land Characteristics Consortium (MRLC). Retrieved from https://www.mrlc.gov/nlcd11_data.php on April 4th 2017.
CDPHE_user_commuity, Colorado Department of Public Health and the Environment (2/19/2018) Colorado County Boundaries [Data File] Retrieved from: https://data-cdphe.opendata.arcgis.com/datasets/colorado-county-boundaries on 1/16/2021
U.S. Geological Survey, National Geospatial Program (06/15/2020) Retrieved from: NHD 20200615 for Colorado State or Territory Shapefile Model Version 2.2.1 [Data File] https://viewer.nationalmap.gov/basic/?basemap=b1&category=nhd&title=NHD%20View#/ on 1/26/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 1-1. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 1-2. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 1-3. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 1-4. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 1-5. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 1-6. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 1-7. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 2-1. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 2-2. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 2-3. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 2-4. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 2-5. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 2-6. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 2-7. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 3-1. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 3-2. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 3-3. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 3-4. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 3-5. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 3-6. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 3-7. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 4-1. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 4-2. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 4-3. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 4-4. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 4-5. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 4-6. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021
ColoradoView/UV-B Monitoring and Research (n.d.) Colorado Digital Elevation Model files - 1 degree: Section 4-7. [Data File] Retrieved from: https://www.coloradoview.org/aerial-imagery/ on 2/06/2021