Speakleash Training and Tuning Datasets

This repository provides tools for generating datasets for training and tuning Large Language Models (LLMs).

Overview

Datasets are divided into three main categories:

Each category consists of the following types of content:

automated
manual
samples

Content types description:

automated

Datasets are fully generated by scripts. The process of downloading data, generating datasets, and saving them is entirely handled by scripts.

manual

Only part of the dataset generation process is automated. Further human intervention is required to complete the datasets.

samples

Examples of the generated datasets (up to 3 records).

Usage

Each category (instructions, conversations, functions) has its own directory, containing subdirectories for automated, manual, and sample datasets. Inside each subdirectory, you will find examples and explanations of how each type of dataset should be structured.

Generated datasets files:

Instructions:

Released instructions version: 2024_03_07_v0_0_13 (expandable list with download links):

All generated instructions in one JSONL file:
speakleash_pl_instructions_2024_03_07_v0_0_13.jsonl

All generated instructions in one JSONL file (Alpaca format):
speakleash_pl_instructions_alpaca_2024_03_07_v0_0_13.jsonl

All generated instructions in one parquet file (Alpaca format):
speakleash_pl_instructions_alpaca_2024_03_07_v0_0_13.parquet

All generated instructions JSON files packed into one zip file:
instructions_not_merged_2024_03_07_v0_0_13.zip

Or using terminal commands:

For Linux:
wget https://d6t0.c15.e2-2.dev/speakleash-instructions-pub/speakleash_pl_instructions_2024_03_07_v0_0_13.jsonl
For Windows:
curl -O https://d6t0.c15.e2-2.dev/speakleash-instructions-pub/speakleash_pl_instructions_2024_03_07_v0_0_13.jsonl

Contribution

To contribute, clone this repository and add a new scripts (e.g., allegro-summarization.py) to the chosen directory (instructions, conversations, functions). If you identify additional types of training datasets that should be included, please contribute by creating an issue in the repository. New sections containing the proposed types will be added based on feedback and discussion.

Further details how to create instructions:

A simple instructions manual (in Polish)

Working with code:

Datasets

Internal datasets from the Speakleash package are downloaded separately to the data_speakleash directory. This temporary solution is implemented due to the current version of the Speakleash package. The manifests files are downloaded automatically to the same directory as datasets, so separating both directories was done for better readability. This functionality was done with the purpose but we are working on some changes, described in this issue.

Workflow directories

Instruction files are generated in the output directory. External datasets are downloaded to the data directory.

Output

To generate one final instructions JSON file, merge them using the merge_files.py script. It will be created in the directory called instructions_merged_and_stats along with statistical files describing the instructions data. To update instruction samples, run the generate_samples.py script. It will generate JSON files with three records each.

Important Information

sentiment_detection.py -> requires HuggingFace token.
orca_math_create_english_docx.py with orca_math_create_json_from_docx.py -> the scripts need to be self-translated in an external service, so they are not included in merge_files.py. More information inside these scripts.
speakleash_forums_questions.py -> if installed requirements won't work, follow the steps included in this documentation: StyloMetrix
If you are facing problems with dependencies, execute manual installation of the following libraries: pip install http://mozart.ipipan.waw.pl/~rtuora/spacy/pl_nask-0.0.7.tar.gz pip install https://github.com/explosion/spacy-models/releases/download/pl_core_news_md-3.7.0/pl_core_news_md-3.7.0-py3-none-any.whl
It is a temporary solution but will work.

Name		Name	Last commit message	Last commit date
Latest commit History 294 Commits
conversations		conversations
functions		functions
instructions		instructions
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
config.py		config.py
generate_samples.py		generate_samples.py
merge_files.py		merge_files.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speakleash Training and Tuning Datasets

Overview

Content types description:

automated

manual

samples

Usage

Generated datasets files:

Contribution

Further details how to create instructions:

Working with code:

Datasets

Workflow directories

Output

Important Information

About

Contributors 10

Languages

speakleash/speakleash-instruct-creator

Folders and files

Latest commit

History

Repository files navigation

Speakleash Training and Tuning Datasets

Overview

Content types description:

automated

manual

samples

Usage

Generated datasets files:

Contribution

Further details how to create instructions:

Working with code:

Datasets

Workflow directories

Output

Important Information

About

Resources

Stars

Watchers

Forks

Contributors 10

Languages