Skip to content

Latest commit

 

History

History
98 lines (65 loc) · 5.24 KB

README.md

File metadata and controls

98 lines (65 loc) · 5.24 KB


Speakleash Training and Tuning Datasets

This repository provides tools for generating datasets for training and tuning Large Language Models (LLMs).

Overview

Datasets are divided into three main categories:

Each category consists of the following types of content:

  • automated
  • manual
  • samples

Content types description:

  • automated

Datasets are fully generated by scripts. The process of downloading data, generating datasets, and saving them is entirely handled by scripts.

  • manual

Only part of the dataset generation process is automated. Further human intervention is required to complete the datasets.

  • samples

Examples of the generated datasets (up to 3 records).

Usage

Each category (instructions, conversations, functions) has its own directory, containing subdirectories for automated, manual, and sample datasets. Inside each subdirectory, you will find examples and explanations of how each type of dataset should be structured.

Generated datasets files:

Instructions:

Released instructions version: 2024_03_07_v0_0_13 (expandable list with download links):

All generated instructions in one JSONL file:
speakleash_pl_instructions_2024_03_07_v0_0_13.jsonl

All generated instructions in one JSONL file (Alpaca format):
speakleash_pl_instructions_alpaca_2024_03_07_v0_0_13.jsonl

All generated instructions in one parquet file (Alpaca format):
speakleash_pl_instructions_alpaca_2024_03_07_v0_0_13.parquet

All generated instructions JSON files packed into one zip file:
instructions_not_merged_2024_03_07_v0_0_13.zip

Or using terminal commands:

  • For Linux:
    wget https://d6t0.c15.e2-2.dev/speakleash-instructions-pub/speakleash_pl_instructions_2024_03_07_v0_0_13.jsonl

  • For Windows:
    curl -O https://d6t0.c15.e2-2.dev/speakleash-instructions-pub/speakleash_pl_instructions_2024_03_07_v0_0_13.jsonl

Contribution

To contribute, clone this repository and add a new scripts (e.g., allegro-summarization.py) to the chosen directory (instructions, conversations, functions). If you identify additional types of training datasets that should be included, please contribute by creating an issue in the repository. New sections containing the proposed types will be added based on feedback and discussion.

Further details how to create instructions:

A simple instructions manual (in Polish)

Working with code:

Datasets

Internal datasets from the Speakleash package are downloaded separately to the data_speakleash directory. This temporary solution is implemented due to the current version of the Speakleash package. The manifests files are downloaded automatically to the same directory as datasets, so separating both directories was done for better readability. This functionality was done with the purpose but we are working on some changes, described in this issue.

Workflow directories

Instruction files are generated in the output directory. External datasets are downloaded to the data directory.

Output

To generate one final instructions JSON file, merge them using the merge_files.py script. It will be created in the directory called instructions_merged_and_stats along with statistical files describing the instructions data. To update instruction samples, run the generate_samples.py script. It will generate JSON files with three records each.

Important Information

  • sentiment_detection.py -> requires HuggingFace token.
  • orca_math_create_english_docx.py with orca_math_create_json_from_docx.py -> the scripts need to be self-translated in an external service, so they are not included in merge_files.py. More information inside these scripts.
  • speakleash_forums_questions.py -> if installed requirements won't work, follow the steps included in this documentation: StyloMetrix
  • If you are facing problems with dependencies, execute manual installation of the following libraries: pip install http://mozart.ipipan.waw.pl/~rtuora/spacy/pl_nask-0.0.7.tar.gz pip install https://github.com/explosion/spacy-models/releases/download/pl_core_news_md-3.7.0/pl_core_news_md-3.7.0-py3-none-any.whl
    It is a temporary solution but will work.