Add custom ops for compatibility with PT Compile #1139

ani300 · 2024-08-08T20:10:00Z

This PR adds basic support for torch.compile() for the non-varlen variants of Flash Attention.

This essentially allows for models that use the flash_attn_qkvpacked_func, flash_attn_kvpacked_func, and flash_attn_func to compile without graph breaks.

I can add unit tests if it makes sense. I'll test it with our own training pipeline for performance measurements and I'll post them later.

This uses the new custom operators API in Pytorch 2.4. I can move to the older APIs if needed, or I can look up how to make both coexist

ani300 · 2024-08-08T20:57:24Z

Added varlen functions too

tridao · 2024-08-08T21:15:29Z

Thank you! If this requires torch 2.4, is there some way to make it optional for older pytorch version? e.g. define torch.library.custom_op to be a no-op? Idk what people usually do.

ani300 · 2024-08-08T22:53:32Z

There's a different API that was used in previous pytorch versions. It's more cumbersome to use, but let me think about how to use both APIs at the same time with a version selection. I can also do a noop wrapper based on version

ani300 · 2024-08-09T13:20:14Z

Ok, I've updated the code to only wrap things if Pytorch is 2.4 or higher. Even though we're comparing strings, torch.__version__ implements semantic comparisons, so it should always work correctly. As soon as I have the training performance comparison I'll update the PR

mayank31398 · 2024-08-11T01:43:05Z

aah, yikes
@ani300 I had started working on the same thing #1145 😓

Ill let you handle this 😃

ani300 · 2024-08-13T20:38:52Z

@tridao I've updated the code after testing it with SFTTrainer to make sure everything was getting called correctly and compiling without graph breaks. I've also added a check for the pytorch version so this only activates if the APIs are available.

One question I have is if it's ok to change the C++ pytorch interface so the out/out_padded padding/unpadding can be done outside of the custom op. Right now, Pytorch custom ops don't allow outputs to be aliases of other outputs, which this pair is when head_dim % 8 == 0. This means I have to clone one of the outputs to unalias them, which is a waste of memory bandwidth.

tridao · 2024-08-14T04:58:32Z

Yes let's change the C++ code to not pad the output, and have the Python part handle that.

GLivshits · 2024-08-19T12:55:10Z

There's a different API that was used in previous pytorch versions. It's more cumbersome to use, but let me think about how to use both APIs at the same time with a version selection. I can also do a noop wrapper based on version

For older versions, you can use:

torch.library.define - just for definition of ops.
Then use this:
OLD_TORCH_VERSION = version.parse(torch.version).base_version < "2.4.0"

if OLD_TORCH_VERSION:
    _torch_custom_op_wrapper = partial(torch.library.impl, types="cuda")
    _torch_register_fake_wrapper = torch.library.impl_abstract
else:
    _torch_custom_op_wrapper = partial(torch.library.custom_op, device_types="cuda", mutates_args=())
    _torch_register_fake_wrapper = torch.library.register_fake

Here I assume that mutates_args is empty. If it is not - I don't currently know how to handle that (and I didnt think much) without copy-pasting lots of code.

mayank31398 · 2024-08-21T04:34:55Z

@GLivshits I dont think it can be handled in older versions of torch

mayank31398 · 2024-08-24T04:52:10Z

@tridao @ani300 is there any progress/updates on this?
Its a pretty neat feature to have Flash Attention fully end-to-end traceable natively.

ani300 · 2024-08-24T22:34:53Z

Hey, I'm almost done with the C++ portion, hopefully by Monday I can update the PR and rerun all the CI to make sure it's still good with all the changes. I'll also try the suggestions from @GLivshits to make it work with previous versions

ani300 · 2024-08-27T17:27:08Z

@GLivshits the way the code is currently written, the backward functions need to have mutate_args, or we need to do an extra memory copy on GPU, which might significantly impact performance

…le()

ani300 · 2024-08-27T18:39:11Z

@tridao this is ready for final review and/or merging. I tested with the whole unit test suite for both pytorch 2.3 and pytorch 2.4 on an A100

raghukiran1224 · 2024-09-03T01:15:43Z

@tridao Any update on this?

anijain2305 · 2024-09-06T23:12:02Z

Thanks @ani300 for taking this on. I started working on this here (#1209). But this is way further along. So, I closed mine.

cc @zou3519 for custom ops API usage review @Chillee

anijain2305 · 2024-09-06T23:23:53Z