The Shopping Classifier is a Python script that uses the k-nearest neighbors algorithm from the scikit-learn library to classify shopping data. It loads data from a CSV file, preprocesses it, splits it into training and testing sets, trains a KNN classifier, and evaluates its performance.
- Python 3.6 or above
- Scikit-learn library
- Clone the repository or download the script file (
shopping.py
). - Install the required dependencies by running the following command:
pip install -U scikit-learn
- Prepare your data in a CSV file, making sure it follows the required format. The file should contain both the features (evidence) and the corresponding labels.
Note: The script expects the last column in the CSV file to be the label column.
- Run the script using the following command:
python shopping.py <data_file.csv>
Replace <data_file.csv>
with the path to your CSV file.
The script performs the following steps:
- Checks the command-line arguments to ensure the correct usage.
- Loads the data from the provided CSV file and preprocesses it for machine learning.
- Splits the data into training and testing sets using a 60:40 ratio.
- Trains a KNN classifier with one neighbor using the training set.
- Makes predictions on the testing set.
- Evaluates the performance of the classifier by calculating sensitivity (true positive rate) and specificity (true negative rate).
- Prints the results, including the number of correct and incorrect predictions, true positive rate, and true negative rate.
If you want to modify the behavior of the script, you can make the following changes:
- Change the
n_neighbors
parameter in theKNeighborsClassifier
constructor to adjust the number of neighbors used for classification. - Modify the
MONTHS
,INT_LABELS
, andFLOAT_LABELS
dictionaries in theload_data
function to customize the mapping of categorical and numerical labels in the CSV file.
This script is provided under the MIT License.
If you encounter any issues or have suggestions for improvements, please create an issue on the repository.
- The script uses the Scikit-learn library for the K-nearest neighbors classifier.
- Data set provided by Sakar, C.O., Polat, S.O., Katircioglu, M. et al. Neural Comput & Applic (2018)
When users are shopping online, not all will end up purchasing something. Most visitors to an online shopping website, in fact, likely don’t end up going through with a purchase during that web browsing session. It might be useful, though, for a shopping website to be able to predict whether a user intends to make a purchase or not: perhaps displaying different content to the user, like showing the user a discount offer if the website believes the user isn’t planning to complete the purchase. How could a website determine a user’s purchasing intent? That’s where machine learning will come in.
I built a nearest-neighbor classifier to solve this problem. Given information about a user — how many pages they’ve visited, whether they’re shopping on a weekend, what web browser they’re using, etc. — my classifier predicted whether or not the user will make a purchase. My classifier wasn’t be perfectly accurate — perfectly modeling human behavior is a task well beyond the scope of this project — but it was better than guessing randomly. To train my classifier, I downloaded some data from a shopping website from about 12,000 users sessions.
How do I measure the accuracy of a system like this? If I have a testing data set, I could run the classifier on the data, and compute what proportion of the time the classifier correctly classify the user’s intent. This would give me a single accuracy percentage. But that number might be a little misleading. Imagine, for example, if about 15% of all users end up going through with a purchase. A classifier that always predicted that the user would not go through with a purchase, then, it would measure as being 85% accurate: the only users it classifies incorrectly are the 15% of users who do go through with a purchase. And while 85% accuracy sounds pretty good, that doesn’t seem like a very useful classifier.
Instead, I measured two values: sensitivity (also known as the “true positive rate”) and specificity (also known as the “true negative rate”). Sensitivity refers to the proportion of positive examples that were correctly identified: in other words, the proportion of users who did go through with a purchase who were correctly identified. Specificity refers to the proportion of negative examples that were correctly identified: in this case, the proportion of users who did not go through with a purchase who were correctly identified. So our “always guess no” classifier from before would have perfect specificity (1.0) but no sensitivity (0.0). My goal was to build a classifier that performs reasonably on both metrics.