Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix packing script w Tiktokenizer on Mac and Windows #697

Closed
wants to merge 2 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions llmfoundry/data/packing.py
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,8 @@ def pad_tensor(tensor: torch.Tensor, pad_value: int):


if __name__ == '__main__':
import multiprocessing as mp
import platform
from argparse import ArgumentParser, Namespace

from omegaconf import OmegaConf as om
Expand All @@ -270,6 +272,14 @@ def pad_tensor(tensor: torch.Tensor, pad_value: int):
from llmfoundry.data import build_text_dataloader
from llmfoundry.utils import build_tokenizer


if platform.system() != 'Linux':
# the default start method is 'fork' on Linux, but 'spawn' on macOS and Windows
# When a child process is generated with fork, the variable is inherited instead of pickled/unpickled.
# But when a child process is generated with spawn, then the arguments are sent through pickling/unpickling.
# This is a problem for the tiktoken tokenizer, which is not picklable.
mp.set_start_method('fork', force=True)

def parse_args() -> Namespace:
"""Parse commandline arguments."""
parser = ArgumentParser(
Expand Down
Loading