Welcome to my data science portfolio the intent is to provide a showcase of the data science and data analysis projects that I've done. Click on the links below to check out some of the projects that I've done
To find out more about me check out the pptx or check out the google slide deck
Four projects are described here. If you want to find out more click on the hyperlinked titles to get more detail and see the codes
Facial recognition challenge for FruitPunch AI: Many sea turtle species are critically endangered and monitoring sea turtle populations is vital. However, tracking a turtle over several captures is a difficult challenge as metal tags can get damaged, and also cause distress to the turtle. By using facial recognition we came up with a solution that is more accurate and faster than manual annotation, and also minimises harm to the turtle.
Images were first passed through a YOLOv8_SAM network to isolate the relevant turtle pixels. Following this a variety of different models were tested to see which was the most effective. The winning solution was SIFT keypoint extraction followed by LightGlue for keypoint matching. However, instead of the basic point matching as demonstrated in the image above I devised a novel metric which compares the distribution of all the keypoints to a null distribution (average distribution of non-matching sea turtles). This difference in distributions was quantified using the Wasserstein distance. My team's solution proved to me more effective than other methods such as metric learning and LoFTR.

A new clustering algorithm that utilises a single distance threshold. Ideal for when you don't know how many clusters there should be but all the points should be closely related.
dCrawlerDemo.mov
dCrawler is particularly useful for clustering colors when compared to DBSCAN
As part of the ML zoomcamp training this was an exercise to get familiar with deploying a solution using docker images. In this example, I used features extracted from histological samples containing malignant or benign tumors. The original data set is nicely curated but with approximately 30 variables is quite large. By utilizing principal component analysis (PCA) I engineered 10 features that explain 95% of the variance. Many models were assessed using a gridsearch method, with the scoring metric being the F1 score due to a class imbalance in the dataset. I found that the most effective model on the validation set was a logistic regression classifier on the 10 Principal components. This produced an F1 score >0.975 on the validation data (see figure below)
In biological imaging often the colours smeer (chromatic aberration) which hinders any further analysis. I noticed that we could model the aberration and so reverse its effects to produce an accurate image. This meant we could keep unaffected areas the same (e.g. E) while correcting the distorted areas (e.g. F)