LLM-jp Corpus

This repository contains scripts to reproduce the LLM-jp corpus.

Number of Tokens

In scripts, we provide scripts to download, filter, and tokenize the data.

The code in this repository is licensed under the Apache 2.0 license.

As for the dataset itself, refer to the licenses of the data subsets:

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
benchmark		benchmark
scripts		scripts
.flake8		.flake8
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
requirements.in		requirements.in
requirements.txt		requirements.txt