Note
With the latest update, the pipeline now supports high-level training control through a YAML file, eliminating the need to modify source code, except when adding a new dataset. For new datasets, you must still convert your dataset to the required format. For all other cases, simply modify the configuration file (i.e., train_config.yaml) to make necessary adjustments.
- Accelerator: NVIDIA RTX 4090D
$\times$ 2 - Platform: Linux
- Internet: Enabled
- LLM: Qwen/Qwen2-1.5B-Instruct
- Dataset: The Learning Agency Lab - PII Data Detection
- Utils: transformers | trl
Important
Key modules are implemented in the qwen2ner
module. For more technical details, please refer to the module.
-
Download the dataset from Kaggle to the
dataset
folder. -
Construct the .csv format dataset.
python3 construct_text_data.py
- Train the model.
./train.sh
- Inference on a single text.
python3 inference.py \
--model_name_or_path MODEL_NAME_OR_PATH