How to look into the processed data? #266

shizhediao · 2024-08-16T16:54:45Z

Hi,

After running tokenize_from_hf_to_s3.py, I would like to inspect the resulting data. But I find that the current data is in a binary file (.ds). is there a way to allow me to look into the data?

Thanks!

The text was updated successfully, but these errors were encountered:

RicardoDominguez · 2024-08-29T12:21:08Z

The following works for me

import numpy as np

from datatrove.pipeline.tokens.merger import load_doc_ends, get_data_reader

def read_tokenized_data(data_file):
    with open(f"{data_file}.index", 'rb') as f:
        doc_ends = load_doc_ends(f)

    reader = get_data_reader(open(data_file, 'rb'), doc_ends, nb_bytes=2)
    decode = lambda x: np.frombuffer(x, dtype=np.uint16).astype(int)
    return map(decode, reader)


from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

data_file = 'test/000_test.ds'
for i, input_ids in enumerate(read_tokenized_data(data_file)):
    if i == 5:
        break
    print(len(input_ids))
    print(tokenizer.decode(input_ids))
    print('\n-------------------\n')

RicardoDominguez · 2024-08-29T12:55:36Z

Alternatively, you could use DatatroveFileDataset from here.

from datatrove.utils.dataset import DatatroveFileDataset

path = 'test/test_tokenized_00000_00000_shuffled.ds'

dataset = DatatroveFileDataset(
    file_path=path,
    seq_len=2048,
    token_size=2,
)

from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

for batch in dataset:
    input_ids = batch['input_ids'].numpy()
    print(tokenizer.decode(input_ids))
    break

shizhediao · 2024-08-29T15:26:25Z

Thank you so much! I will have a try.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to look into the processed data? #266

How to look into the processed data? #266

shizhediao commented Aug 16, 2024 •

edited

Loading

RicardoDominguez commented Aug 29, 2024

RicardoDominguez commented Aug 29, 2024 •

edited

Loading

shizhediao commented Aug 29, 2024 •

edited

Loading

How to look into the processed data? #266

How to look into the processed data? #266

Comments

shizhediao commented Aug 16, 2024 • edited Loading

RicardoDominguez commented Aug 29, 2024

RicardoDominguez commented Aug 29, 2024 • edited Loading

shizhediao commented Aug 29, 2024 • edited Loading

shizhediao commented Aug 16, 2024 •

edited

Loading

RicardoDominguez commented Aug 29, 2024 •

edited

Loading

shizhediao commented Aug 29, 2024 •

edited

Loading