This project is about classifying products using Machine Learning. The main script is main.py
.
- Python 3.6 or higher
- Libraries: sklearn, pandas, numpy, argparse, time, pickle, os
My validation data is provided by my lecturer, it located at data/validate.txt
. And I get the label from this then I use it for crawling data for training.
I scrape from tiki
and vatgia
website. You can try to crawl the data following my repo: [Crawl Data]
(https://github.com)
The data used for training in this project is located in data/train.txt
. Each line in the file represents a
product with its category and description.
For testing, the data is located in data/test.txt
. Each line in the file represents a product with its category and description.
The data is in the following format:
__label__cate_gory description
To run the script, use the following command:
bash train.sh
The project consists of several Python scripts:
- train.sh: This shell script sets up the environment and runs the train.py script with the necessary arguments.
- train.py: This script is responsible for loading the data, preprocessing it, splitting it into training and testing sets, training a Naive Bayes model, and evaluating the model's performance.
The project uses a Naive Bayes model for product classification. The model is trained on the product data and then evaluated for its performance. The trained model and the vectorizer are saved as pickle files for future use.
The workflow of the project is as follows:
- Load the training data from data/train.txt.
- Preprocess the training data (if the --is_preprocess flag is set to True).
- Encode the target labels into numerical form.
- Train a Naive Bayes model on the training data (if the --is_train flag is set to True).
- Evaluate the model's performance on the validate data (if the --is_evaluate flag is set to True).
- Save the trained model and the vectorizer as pickle files.
The project is designed to be flexible, allowing you to control the preprocessing, training, and evaluation stages through command-line arguments.