Length masking for batch inputs #1173

barronalex · 2024-12-18T06:44:21Z

Allow passing an array of lengths to make_prompt_cache.

create_attention_mask then does the correct masking when you call an mlx-lm model with a batch of padded variable-length inputs and the cache you created above.

Very open to other ways of doing this, but I went with this because it allows us to leave the model code untouched.

awni · 2024-12-18T20:22:09Z

Very open to other ways of doing this, but I went with this because it allows us to leave the model code untouched.

I do wonder if we should revisit our design of model interface to include an optional mask. For cases that require flexible masks it will be a lot more .. well .. flexible. The other added benefit is it is more functional and easier to compile / export if we decide to go that route.

Other than the tedious aspect of changing the model interface.. how would that work for your use case?

barronalex · 2024-12-18T20:41:50Z

That would work great too and would definitely be easier to follow in the code.

The only gotcha will be making sure that the mask you pass has the same dtype as the rest of the model so you don’t accidentally upcast.

awni · 2024-12-18T20:46:25Z

The only gotcha will be making sure that the mask you pass has the same dtype as the rest of the model so you don’t accidentally upcast.

It's a good point and is easy to do. If we do go that route (and maybe either way) we should require the mask type to be the same type as keys/queries/values in the scaled_dot_product fast implementation. Or we could require that result_type(mask.dtype(), queries.dtype()) == queries.dtype() or something like that.

chimezie · 2024-12-18T20:47:55Z

+1 on "revisit our design of model interface to include an optional mask."

barronalex · 2024-12-18T20:55:32Z

I'll redraft this to add a mask input everywhere. I'll make a PR in the core repo too for the scaled_dot_product_attention change.

barronalex · 2024-12-18T22:01:39Z

OK all done.

I'm a little worried we're going to break anyone using model(x, cache) but model(x, mask, cache) is more consistent with all of the other places we use the mask.

awni · 2024-12-18T22:05:49Z

Nice, thanks a ton for making that change!!

anyone using model(x, cache) but model(x, mask, cache) is more consistent with all of the other places we use the mask.

It's an easy fix for them to make it a kwarg. Let's land it and see what happens.

llms/mlx_lm/models/base.py

awni

Looks great, thanks!!

chimezie · 2024-12-18T23:37:30Z

So, just a question to help my understanding of the implications of this: you no longer need to 'remove' any prefix that was cached beforehand (as said here)?

awni · 2024-12-19T00:21:43Z

So, just a question to help my understanding of the implications of this: you no longer need to 'remove' any prefix that was cached beforehand

Right now this doesn't change any behavior in mlx lm for training or fine-tuning. It's just some additional functionality that let's you specify a mask and some lengths to the mask creation function.

For cases like:

prefill prefx
generate with question A
trim cache to prefix length
generate with question B

You could use a mask instead but I would in general not advise it since the cache will grow and you will be doing a whole bunch of wasted computation.

llllvvuu · 2024-12-27T09:19:33Z

This masks out the most recent tokens (right-padding), right?

print(create_causal_mask(3, 1, lengths=mx.array([1, 2, 3, 1])))

length masking

c5ce9a3

barronalex requested review from angeloskath and awni December 18, 2024 06:44

Alex Barron added 3 commits December 18, 2024 13:54

add mask to mlx_lm model interface

cd9dcf0

remove lengths

ef895f6

fix test:

eb9452b

barronalex mentioned this pull request Dec 18, 2024

Check mask dtype in SDPA ml-explore/mlx#1721

Merged

awni reviewed Dec 18, 2024

View reviewed changes

llms/mlx_lm/models/base.py Outdated Show resolved Hide resolved

awni approved these changes Dec 18, 2024

View reviewed changes

comment + fix

5b414dd

barronalex merged commit d4ef909 into main Dec 19, 2024
2 checks passed

barronalex deleted the cache-lengths branch December 19, 2024 03:43

llllvvuu mentioned this pull request Dec 27, 2024

feat(mlx_lm)!: batch_generate #948

Open

awni mentioned this pull request Jan 10, 2025

Enable custom masks as optional input for models for batch processing #1044

Closed

Blaizzy mentioned this pull request Jan 11, 2025

Fix Cohere2: mask shape error (long context) #1202

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Length masking for batch inputs #1173

Length masking for batch inputs #1173

barronalex commented Dec 18, 2024

awni commented Dec 18, 2024

barronalex commented Dec 18, 2024

awni commented Dec 18, 2024

chimezie commented Dec 18, 2024

barronalex commented Dec 18, 2024

barronalex commented Dec 18, 2024

awni commented Dec 18, 2024

awni left a comment

chimezie commented Dec 18, 2024

awni commented Dec 19, 2024

llllvvuu commented Dec 27, 2024

Length masking for batch inputs #1173

Length masking for batch inputs #1173

Conversation

barronalex commented Dec 18, 2024

awni commented Dec 18, 2024

barronalex commented Dec 18, 2024

awni commented Dec 18, 2024

chimezie commented Dec 18, 2024

barronalex commented Dec 18, 2024

barronalex commented Dec 18, 2024

awni commented Dec 18, 2024

awni left a comment

Choose a reason for hiding this comment

chimezie commented Dec 18, 2024

awni commented Dec 19, 2024

llllvvuu commented Dec 27, 2024