Use custom cache dir for tokenizer download, too #41

erickpeirson · 2024-11-07T14:41:03Z

Presently, passing cache_dir: Path to WordLlama.load() has no impact on the cache directory where tokenizer assets are stored. This makes it impossible to use WordLlama in an environment where the default cache path (the user's home directory) is not writable, which is often the case in production scenarios.

This PR does two things:

Modifies the meaning of cache_dir parameter on the WordLlama.load() method to be the cache root directory, within which the tokenizers and weights subdirectories are created;
Ensures that the cache_dir is passed to check_and_download_tokenizer and used, so that all writes occur within a configurable cache directory;

Note that this will effectively bust the cache on upgrade. But I'm hoping that's a small price to pay for the fix.

dleemiller · 2024-11-07T23:06:11Z

Nice - definitely a necessary change for deploying to places like lambda functions. Thanks!

dleemiller · 2024-11-10T21:45:39Z

#42

I have decided to clean everything up and simplify the API by removing the weights_dir as well. That feels legacy and over-complicated to me now to have both keyword arguments.

erickpeirson added 2 commits November 7, 2024 12:13

parameterize check_and_download_tokenizer with cache_dir

72c45bc

update tests

61d49e4

dleemiller merged commit d8810b8 into dleemiller:main Nov 7, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use custom cache dir for tokenizer download, too #41

Use custom cache dir for tokenizer download, too #41

erickpeirson commented Nov 7, 2024

dleemiller commented Nov 7, 2024

dleemiller commented Nov 10, 2024

Use custom cache dir for tokenizer download, too #41

Use custom cache dir for tokenizer download, too #41

Conversation

erickpeirson commented Nov 7, 2024

dleemiller commented Nov 7, 2024

dleemiller commented Nov 10, 2024