This repository provides tools for generating datasets for training and tuning Large Language Models (LLMs).
Datasets are divided into three main categories:
Each category consists of the following types of content:
- automated
- manual
- samples
Datasets are fully generated by scripts. The process of downloading data, generating datasets, and saving them is entirely handled by scripts.
Only part of the dataset generation process is automated. Further human intervention is required to complete the datasets.
Examples of the generated datasets (up to 3 records).
Each category (instructions, conversations, functions) has its own directory, containing subdirectories for automated, manual, and sample datasets. Inside each subdirectory, you will find examples and explanations of how each type of dataset should be structured.
Instructions:
Released instructions version: 2024_03_07_v0_0_13 (expandable list with download links):
All generated instructions in one JSONL file:
speakleash_pl_instructions_2024_03_07_v0_0_13.jsonl
All generated instructions in one JSONL file (Alpaca format):
speakleash_pl_instructions_alpaca_2024_03_07_v0_0_13.jsonl
All generated instructions in one parquet file (Alpaca format):
speakleash_pl_instructions_alpaca_2024_03_07_v0_0_13.parquet
All generated instructions JSON files packed into one zip file:
instructions_not_merged_2024_03_07_v0_0_13.zip
Or using terminal commands:
-
For Linux:
wget https://d6t0.c15.e2-2.dev/speakleash-instructions-pub/speakleash_pl_instructions_2024_03_07_v0_0_13.jsonl
-
For Windows:
curl -O https://d6t0.c15.e2-2.dev/speakleash-instructions-pub/speakleash_pl_instructions_2024_03_07_v0_0_13.jsonl
To contribute, clone this repository and add a new scripts (e.g., allegro-summarization.py
) to the chosen directory (instructions, conversations, functions). If you identify additional types of training datasets that should be included, please contribute by creating an issue in the repository. New sections containing the proposed types will be added based on feedback and discussion.
A simple instructions manual (in Polish)
Internal datasets from the Speakleash
package are downloaded separately to the data_speakleash
directory. This temporary solution
is implemented due to the current version of the Speakleash
package. The manifests
files are downloaded automatically to the same
directory as datasets
, so separating both directories was done for better readability. This functionality was done with the purpose but we are working on some changes, described in this issue.
Instruction files are generated in the output
directory.
External datasets are downloaded to the data
directory.
To generate one final instructions JSON file, merge them using the merge_files.py
script. It will be created in the
directory called instructions_merged_and_stats
along with statistical files describing the instructions data.
To update instruction samples, run the generate_samples.py
script. It will generate JSON files with three records each.
sentiment_detection.py
-> requires HuggingFace token.orca_math_create_english_docx.py
withorca_math_create_json_from_docx.py
-> the scripts need to be self-translated in an external service, so they are not included inmerge_files.py
. More information inside these scripts.speakleash_forums_questions.py
-> if installed requirements won't work, follow the steps included in this documentation: StyloMetrix- If you are facing problems with dependencies, execute manual installation of the following libraries:
pip install http://mozart.ipipan.waw.pl/~rtuora/spacy/pl_nask-0.0.7.tar.gz
pip install https://github.com/explosion/spacy-models/releases/download/pl_core_news_md-3.7.0/pl_core_news_md-3.7.0-py3-none-any.whl
It is a temporary solution but will work.