Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix packing script w Tiktokenizer on Mac and Windows #697

Closed
wants to merge 2 commits into from

Conversation

samhavens
Copy link
Contributor

We recently added tiktokenizer support, and it turns out this broke support for the packing script on Mac (and likely Windows, though that is untested). Why is explained in the comments.

I tested this change and it works on Mac and does not break Linux. I have not tested on Windows.

@dakinggg
Copy link
Collaborator

cc @irenedea since you're about to change all of this

@irenedea
Copy link
Contributor

irenedea commented Oct 27, 2023

@dakinggg @samhavens I think the proper solution here would be to make TiktokenTokenizerWrapper picklable.
#700 What do you think of this? If it looks good to you, I think this would be a more robust solution.

@samhavens
Copy link
Contributor Author

If the only reason it needs to be pickleable is because of the way multiprocessing works on Mac+Windows vs Linux, then I could go either way on which solution I prefer. But if there are any other cases where the tokenizer being pickleable is useful (storing state in some way) then I agree with you. Also, neither seems particularly harder and your PR is also ready, so I will close this PR.

@samhavens samhavens closed this Oct 27, 2023
@irenedea
Copy link
Contributor

@samhavens Sounds good! Multiprocessing is the main case where pickling is required (particularly in dataloader), but even so we'd need to handle this at every entry point in our repo not just in packing. Plus, HF tokenizers are all picklable, so I'd rather not break this assumption.

@dakinggg dakinggg deleted the sam/fix-packing-on-nonlinux branch November 17, 2023 06:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants