This repository contains a machine learning project for detecting hate speech in Portuguese text using various transformer-based models.
dataset/
: Contains the dataset filesmodel/
: Contains the model implementationsutils/
: Contains utility functions
- Clone this repository:
git clone https://github.com/eliasqueirogavieira/NLP_HateSpeechBR.git
cd NLP_HateSpeechBR
- Install the required packages:
- Python 3.11
pip install -r requirements.txt
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
To train the model, use the train_model.py
script. You can customize various parameters:
python train_model.py [OPTIONS]
Options:
--model
: Type of model to use (bert, roberta, xlm-roberta, or bertimbau). Default is "bertimbau".--epochs
: Number of training epochs. Default is 5.--learning_rate
: Learning rate. Default is 2e-5.--batch_size
: Batch size. Default is 32.--data_path
: Path to the training and validation data CSV file. Default is "dataset/train_val_data.csv".--output_path
: Path to save the trained model. If not provided, a default name will be used.
Example:
python train_model.py --model bertimbau --epochs 5 --learning_rate 2e-5 --batch_size 32
To evaluate a trained model, use the evaluate_model.py
script:
python evaluate_model.py [OPTIONS]
Options:
--model
: Type of model to evaluate (bert, roberta, xlm-roberta, or bertimbau). Default is "bertimbau".--model_path
: Path to the trained model weights. Default is "bertimbau_hatespeech_classifier.pth".--data_path
: Path to the testing data CSV file. Default is "dataset/test_data.csv".--batch_size
: Batch size for evaluation. Default is 32.
Example:
python evaluate_model.py --model bertimbau --model_path bertimbau_hatespeech_classifier.pth
The project supports the following models:
- BERT
- RoBERTa
- XLM-RoBERTa
- BERTimbau (Portuguese BERT)
This project uses two publicly available datasets for Portuguese hate speech detection:
-
OffComBR Dataset
- Source: OffComBR GitHub Repository
- Description: A dataset of offensive comments in Brazilian Portuguese, collected from news websites and social media.
- Files used:
- OffComBR2.arff
- OffComBR3.arff
-
HateBR Dataset
- Source: HateBR GitHub Repository
- Description: A large-scale dataset for hate speech detection in Brazilian Portuguese, collected from Instagram.
- File used:
- HateBR.csv
These datasets are combined and preprocessed for use in this project. Please refer to the original repositories for more information about the datasets, including their collection methodologies, annotations, and usage terms.
Note: Ensure you comply with the usage terms and provide appropriate attribution when using these datasets.
The datasets are preprocessed and combined into two CSV files:
dataset/train_val_data.csv
: For training and validation (80% of the combined data)dataset/test_data.csv
: For final evaluation (20% of the combined data)
Each CSV contains 'text' and 'label' columns. The preprocessing steps include:
- Loading and merging the datasets
- Splitting into train+validation and test sets
- Saving as CSV files for easy loading during training and evaluation
After training, the model will be saved in the specified output path or with a default name based on the model type.
Evaluation results will display accuracy, F1 score, and a detailed classification report.
- Ensure you have sufficient GPU resources for training larger models.
- Adjust batch size based on your GPU memory capacity.
- For best results with Portuguese text, the BERTimbau model is recommended.
We would like to thank the creators and contributors of the OffComBR and HateBR datasets for making their data publicly available for research purposes